[CWB] Trying to deal with tag problems when encoding

Fri May 19 05:52:07 CEST 2017

I'm trying to encode a corpus made up of ~1.2 million tagged .vrt files
which have about 15 attributes in the <text> tag. Each file starts with a
tag like this:

<text id="cc0699457" corpus="cc-c" tagger="connexor"
label="prof_per_diar_gen_san_mrc" channel="written" audience="public"
purpose="professional" type="press" medium="newspaper" field="journalism"
area="news" location="stgo" source="em">

...and ends with this:

</text>

And I'm using the following options to encode it (I've split the lines for
legibility):

sudo /usr/local/cwb-3.4.11/bin/cwb-encode
  -c utf8
  -d /home/usuario/corpus/cc-c/ims-data
  -F /home/usuario/corpus/cc-c/src-txt
  -R /usr/local/share/cwb/registry/cc-c
  -xsB
  -P lemma -P syn -P pos -P spos
  -S s -S p -S
text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source

*So the problem is this:* when I encode this corpus, CWB throws the
following error:

Warning: missing </text> tag inserted at end of input.
1184930 <text> regions dropped because of deep nesting.

Upon reading this, I assumed that my metadata tagging script had made a
mess of things, omitting the </text> tag left, right and center. In over
90% of the files it tagged, in fact. That seemed fishy, so I started
looking at the tail of the .vrt files more or less randomly, and after half
an hour I've yet to find a single case of a missing </text> tag. Not one.
And while the corpus compiles this way, less than 10% of the texts have the
extensive metadata that makes them so very useful. So I need to deal with
this somehow.

So to better understand the nature of the problem, *two questions*:

1. Can a single missing closing tag really have this massive cascading
effect on all texts processed after it?

2. If the CWB encoder is actually inserting missing </text> tags like the
error message says it's doing, why is this problem happening at all? Am I
analyzing the problem incorrectly, or is the encoder not doing what it says
its doing?

Thanks in advance,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170518/89e62b0d/attachment.html>