[CWB] Indexing problems

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Jul 22 06:10:22 CEST 2010


Hi Eros,

That error message is not as clear as it might be, but having had a quick glance at the code, it seems that the warning is being raised for one of the following reasons:
1) you have an XML attribute that is not followed by an =
2) you have an XML attribute-value that is not terminated by a quote mark
3) you have an XML attribute or value that is an empty string

A common cause of these sorts of problem is the open-angle-bracket character < occurring within the text itself, but not safe-encoded as either &lt; or &#x3c; . This causes the following text to be parsed as XML - almost inevitably leading a parse error since it ISN'T xml. 

You often get stray < characters if your source corpus includes mathematical formula or probability-level statements -- any instance of, say,  p < 0.05 in the original text will mess up the XML encoding. 

Can you check whether this is the case? (Note that the < character may not be on the specific line on which the error is detected - it may be shortly before. )

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Serge Heiden
Sent: 21 July 2010 19:11
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Indexing problems

Eros,

I don't know if the error report is true or false but
it is difficult to analyze without an exerpt of your
parapedia_en.tgd file.
Are you sure that all your <text> tags have a corresponding
</text> ending tag ?

Best,
Serge

Selon Eros Zanchetta:
> Hi there,
> 
> I hope someone can help me with this because it's driving me crazy. I'm 
> trying to encode a corpus with cwb-encode, the syntax I use is:
> 
> cwb-encode -d PARAPEDIA_EN -f parapedia_en.tgd -R 
> /usr/local/share/cwb/registry/parapedia_en -P pos -P lemma -S corpus -S 
> text:0+id+target+keywords -S s >parapedia_en_indexing.out 
> 2>parapedia_en_indexing.err
> 
> there appears to be something wrong with the corpus, unfortunately I 
> can't figure out what it is (I attached the error stream from the 
> encoding process to this e-mail).
> 
> What baffles me is the error reports, I assume that when it says:
> 
> Attributes of open tag <text ...> ignored because of syntax error (file 
> [...], line #1021648).
> 
> it means that at line 1021648 of the input file there is a <text> tag 
> with some kind of syntax error, but there's no <text> tag at that line 
> (I obviously tried a few other lines, but not all of them since it's a 
> very large file). Am I reading the error report wrong?
> 
> I use version 2.2.100 of cwb.
> 
> Thanks in advance,
> Eros
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
********
*FR* Merci d'utiliser ma nouvelle adresse mail slh at ens-lyon.fr ****
*EN* Please use my new email address slh at ens-lyon.fr           ****
********
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lsh.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list