[CWB] Trying to deal with tag problems when encoding

Stefan Evert stefanML at collocations.de
Fri May 19 13:15:23 CEST 2017


> On 19 May 2017, at 08:27, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> By the way, what's the logic at work in this command? Are corpus contents ordered according to size?

No, my logic was quite simple: If there's a missing </text> tag in one of your files, this region isn't closed and will extend to the very end of the corpus (unless there is a superfluous </text> tag or a damaged <text> in another file).  So there was a good change that the last <text> region in the corpus would be the critical one.

> Wonderful -- this found another two cases of bad tags! Unfortunately, this method doesn't show the filename, but the corpus index number. Any suggestion on getting the name of the file from this? It's encoded as the unique text identifier, so the info should be there, but show +text_iddoesn't change what is shown.

That will only show <text> tags in the kwic lines, i.e. if the start of the text happens to fall within the displayed context.  What you want is

	set PrintStructures text_id;
	cat;

> As a hotfix, just encode without the :0 nesting specifier, i.e.
> 
>         … -S text+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source
> 
> cwb-encode then assumes that <text> regions cannot be nested and will automatically insert </text> end tag whenever it encounters a new region.
> 
> I'll give this a shot if I can't manage to identify the two files I need to fix. 

One of the advantages of -S text:0 is that it shows you there is a problem – with the "hotfix" solution, it's completely hidden.

Best,
Stefan


More information about the CWB mailing list