[CWB] Indexing problems

Stefan Evert stefanML at collocations.de
Thu Jul 22 11:54:40 CEST 2010


Hi Eros!

>  hope someone can help me with this because it's driving me crazy. I'm trying to encode a corpus with cwb-encode, the syntax I use is:
> 
> cwb-encode -d PARAPEDIA_EN -f parapedia_en.tgd -R /usr/local/share/cwb/registry/parapedia_en -P pos -P lemma -S corpus -S text:0+id+target+keywords -S s >parapedia_en_indexing.out 2>parapedia_en_indexing.err
> 
> there appears to be something wrong with the corpus, unfortunately I can't figure out what it is (I attached the error stream from the encoding process to this e-mail).
> 
> What baffles me is the error reports, I assume that when it says:
> 
> Attributes of open tag <text ...> ignored because of syntax error (file [...], line #1021648).
> 
> it means that at line 1021648 of the input file there is a <text> tag with some kind of syntax error,

Exactly.  cwb-encode has found a line starting with "<text ", tries to parse the tag attributes to extract the values of id="...", target="..." and keywords="...", and fails to do so because there's some format error (or some valid XML format that cwb-encode doesn't support).

Have you tried validating the input file with an XML parser?  If you didn't generate it with a proper XML library, there may be some stray " or so in one of the attribute values.  That's the most common reason for such errors.


> but there's no <text> tag at that line (I obviously tried a few other lines, but not all of them since it's a very large file). Am I reading the error report wrong?

The line numbers should be correct.  cwb-encode used to report token numbers, but some years ago I added code that keeps track of the input line some -- specifically for the purpose of making these error messages more useful.

If the line numbers are really off, possibly there's a CR/LF problem with the line endings, or there are oversized input lines (you're using an outdated version of the CWB which doesn't properly check this yet).

If you send me the first 3M lines of your input file, I can take a closer look at the problem.

> I use version 2.2.100 of cwb.

Can you try with 3.0.0?  You don't have to make a full install, just run the cwb-encode binary from the official release.

Cheers,
Stefan


More information about the CWB mailing list