<div dir="ltr">I'm trying to encode a corpus made up of ~1.2 million tagged <font face="monospace, monospace">.vrt</font> files which have about 15 attributes in the <font face="monospace, monospace"><text></font> tag. Each file starts with a tag like this:
<div><br></div><div><font face="monospace, monospace"><text id="cc0699457" corpus="cc-c" tagger="connexor" label="prof_per_diar_gen_san_mrc" channel="written" audience="public" purpose="professional" type="press" medium="newspaper" field="journalism" area="news" location="stgo" source="em"></font><br></div><div><br></div><div>...and ends with this:</div><div><br></div><div><font face="monospace, monospace"></text><br></font></div><div><br></div><div>And I'm using the following options to encode it (I've split the lines for legibility):</div><div><br></div><div><font face="monospace, monospace">sudo /usr/local/cwb-3.4.11/bin/cwb-encode </font></div><div><font face="monospace, monospace"> -c utf8 </font></div><div><font face="monospace, monospace"> -d /home/usuario/corpus/cc-c/ims-data </font></div><div><font face="monospace, monospace"> -F /home/usuario/</font><span style="font-family:monospace,monospace">corpus</span><font face="monospace, monospace">/cc-c/src-txt</font></div><div><font face="monospace, monospace"> -R /usr/local/share/cwb/registry/cc-c </font></div><div><font face="monospace, monospace"> -xsB </font></div><div><font face="monospace, monospace"> -P lemma -P syn -P pos -P spos </font></div><div><font face="monospace, monospace"> -S s -S p -S text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source</font><br></div><div><br></div><div><b>So the problem is this:</b> when I encode this corpus, CWB throws the following error:</div><div><br></div><div><div><font face="monospace, monospace">Warning: missing </text> tag inserted at end of input.</font></div><div><font face="monospace, monospace">1184930 <text> regions dropped because of deep nesting.</font></div></div><div><br></div><div>Upon reading this, I assumed that my metadata tagging script had made a mess of things, omitting the <span style="font-family:monospace,monospace"></text></span> tag left, right and center. In over 90% of the files it tagged, in fact. That seemed fishy, so I started looking at the tail of the <font face="monospace, monospace">.vrt</font> files more or less randomly, and after half an hour I've yet to find a single case of a missing <span style="font-family:monospace,monospace"></text></span> tag. Not one. And while the corpus compiles this way, less than 10% of the texts have the extensive metadata that makes them so very useful. So I need to deal with this somehow.</div><div><br></div><div>So to better understand the nature of the problem, <b>two questions</b>:</div><div><br></div><div>1. Can a single missing closing tag really have this massive cascading effect on all texts processed after it?</div><div><br></div><div>2. If the CWB encoder is actually inserting missing <span style="font-family:monospace,monospace"></text> </span>tags like the error message says it's doing, why is this problem happening at all? Am I analyzing the problem incorrectly, or is the encoder not doing what it says its doing?</div><div><br></div><div>Thanks in advance,</div><div>Scott</div><div><br></div><div><br></div><div><br></div></div>