[CWB] Indexing problems
Eros Zanchetta
eros at sslmit.unibo.it
Thu Jul 22 12:05:18 CEST 2010
Andrew, Serge,
there were indeed a few rogue <text> elements (the corpus contains
wikipedia articles) that had escaped my regexp searches (I didn't expect
elements with no attributes...). There were also a few angle brackets.
Removing them solved most of my problems, I still get the syntax error
messages but they are probably caused by stray double quotes in the
attributes (line numbers are still not very helpful in identifying the
problem though...)
Thanks a lot for you help,
Eros
On 07/22/2010 06:10 AM, Hardie, Andrew wrote:
> Hi Eros,
>
> That error message is not as clear as it might be, but having had a quick glance at the code, it seems that the warning is being raised for one of the following reasons:
> 1) you have an XML attribute that is not followed by an =
> 2) you have an XML attribute-value that is not terminated by a quote mark
> 3) you have an XML attribute or value that is an empty string
>
> A common cause of these sorts of problem is the open-angle-bracket character < occurring within the text itself, but not safe-encoded as either < or < . This causes the following text to be parsed as XML - almost inevitably leading a parse error since it ISN'T xml.
>
> You often get stray < characters if your source corpus includes mathematical formula or probability-level statements -- any instance of, say, p < 0.05 in the original text will mess up the XML encoding.
>
> Can you check whether this is the case? (Note that the < character may not be on the specific line on which the error is detected - it may be shortly before. )
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Serge Heiden
> Sent: 21 July 2010 19:11
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Indexing problems
>
> Eros,
>
> I don't know if the error report is true or false but
> it is difficult to analyze without an exerpt of your
> parapedia_en.tgd file.
> Are you sure that all your <text> tags have a corresponding
> </text> ending tag ?
>
> Best,
> Serge
>
> Selon Eros Zanchetta:
>
>> Hi there,
>>
>> I hope someone can help me with this because it's driving me crazy. I'm
>> trying to encode a corpus with cwb-encode, the syntax I use is:
>>
>> cwb-encode -d PARAPEDIA_EN -f parapedia_en.tgd -R
>> /usr/local/share/cwb/registry/parapedia_en -P pos -P lemma -S corpus -S
>> text:0+id+target+keywords -S s >parapedia_en_indexing.out
>> 2>parapedia_en_indexing.err
>>
>> there appears to be something wrong with the corpus, unfortunately I
>> can't figure out what it is (I attached the error stream from the
>> encoding process to this e-mail).
>>
>> What baffles me is the error reports, I assume that when it says:
>>
>> Attributes of open tag <text ...> ignored because of syntax error (file
>> [...], line #1021648).
>>
>> it means that at line 1021648 of the input file there is a <text> tag
>> with some kind of syntax error, but there's no <text> tag at that line
>> (I obviously tried a few other lines, but not all of them since it's a
>> very large file). Am I reading the error report wrong?
>>
>> I use version 2.2.100 of cwb.
>>
>> Thanks in advance,
>> Eros
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>
More information about the CWB
mailing list