[CWB] Trying to deal with tag problems when encoding

Stefan Evert stefanML at collocations.de
Fri May 19 08:01:27 CEST 2017


> On 19 May 2017, at 05:52, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> 1. Can a single missing closing tag really have this massive cascading effect on all texts processed after it?

Yes, because you're encoding with -S text:0, telling cwb-encode that <text> elements can be nested (but any nested ones should be ignored).  So if there's a single missing </text> in one of your files, the current <text> region will stay open until the end of the corpus and all following .vrt files will be nested in it (and hence dropped).

This is also how you can find the culprit quite easily.  Use cwb-s-decode to look for a huge <text> region (very probably the last one in the corpus), then check it's id to locate the problematic file.  

	cwb-s-decode CC-C -S text_id | tail -1

should do the trick (but do check whether that's really a very large region).


Based on recent experience, you should also look out for invalid control characters in your files.  We had similar problems with extra DEL chars (ASCII 127) before </text> tags:

	<text id="…" …>
	…
	DEL DEL </text>

Since the last line doesn't start with "<", it isn't recognized as an XML tag and inserted as a literal token (without annotations, so lemma, syn, … will be __UNDEF__ at this point).  If you have indexed the corpus completely, you can also use CWB to look for

	".*<.+>.*";


> 2. If the CWB encoder is actually inserting missing </text> tags like the error message says it's doing, why is this problem happening at all?

It only inserted the single missing </text> tag at the end of the corpus.  CWB doesn't assume that .vrt files are independent components, so XML regions may very well span across multiple .vrt files.

> And while the corpus compiles this way, less than 10% of the texts have the extensive metadata that makes them so very useful. So I need to deal with this somehow.

As a hotfix, just encode without the :0 nesting specifier, i.e.

	… -S text+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source

cwb-encode then assumes that <text> regions cannot be nested and will automatically insert </text> end tag whenever it encounters a new region.

Best,
Stefan


More information about the CWB mailing list