[CWB] Trying to deal with tag problems when encoding
Scott Sadowsky
ssadowsky at gmail.com
Fri May 19 05:52:07 CEST 2017
I'm trying to encode a corpus made up of ~1.2 million tagged .vrt files
which have about 15 attributes in the <text> tag. Each file starts with a
tag like this:
<text id="cc0699457" corpus="cc-c" tagger="connexor"
label="prof_per_diar_gen_san_mrc" channel="written" audience="public"
purpose="professional" type="press" medium="newspaper" field="journalism"
area="news" location="stgo" source="em">
...and ends with this:
</text>
And I'm using the following options to encode it (I've split the lines for
legibility):
sudo /usr/local/cwb-3.4.11/bin/cwb-encode
-c utf8
-d /home/usuario/corpus/cc-c/ims-data
-F /home/usuario/corpus/cc-c/src-txt
-R /usr/local/share/cwb/registry/cc-c
-xsB
-P lemma -P syn -P pos -P spos
-S s -S p -S
text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source
*So the problem is this:* when I encode this corpus, CWB throws the
following error:
Warning: missing </text> tag inserted at end of input.
1184930 <text> regions dropped because of deep nesting.
Upon reading this, I assumed that my metadata tagging script had made a
mess of things, omitting the </text> tag left, right and center. In over
90% of the files it tagged, in fact. That seemed fishy, so I started
looking at the tail of the .vrt files more or less randomly, and after half
an hour I've yet to find a single case of a missing </text> tag. Not one.
And while the corpus compiles this way, less than 10% of the texts have the
extensive metadata that makes them so very useful. So I need to deal with
this somehow.
So to better understand the nature of the problem, *two questions*:
1. Can a single missing closing tag really have this massive cascading
effect on all texts processed after it?
2. If the CWB encoder is actually inserting missing </text> tags like the
error message says it's doing, why is this problem happening at all? Am I
analyzing the problem incorrectly, or is the encoder not doing what it says
its doing?
Thanks in advance,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170518/89e62b0d/attachment.html>
More information about the CWB
mailing list