[CWB] problems encoding xml header data

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 5 13:10:13 CEST 2014


Hi Gertrud,

Here are the things that can cause that particular error:

- The encode tool reaches a position where it expects an equals sign, and does not find one
e.g. <xxx yyy"zzz" />

- A value is missing an end quote
e.g. <xxx yyy="zzz />

- An attribute name is of zero length
e.g. <xxx ="zzz" />

- A value was empty, but not shown explicitly by "" or ''
e.g. <xxx yyy= />

I suggest you look at the *first* line of the file where the error is reported, and see which of these might apply. The rest are probably more examples of the same thing!

Alternatively, if you are using the cutting edge version in SVN, you might find it handy to update & recompile: I have just amended the code for cwb-encode so that now, each of the four problems above will give a slightly different error message, hopefully making it easier to track down problems.

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of "Gertrud Faaß"
Sent: 05 May 2014 10:46
To: cwb at sslmit.unibo.it
Subject: [CWB] problems encoding xml header data

Dear all,
I am trying to encode data from a news paper with cwb-encode while trying to keep all the available metadata flags. However, though I think I got the syntax right (? see below), cwb-encode tells me that there are syntax errors. Unfortunately, the error message does not give any more details. Could it be that there are limitations to characters that may appear in such xml data, i.e. could it possibly cause problems, when ":" or "/" appear in such attribute fields? 

The error message looks as follows: Attributes of open tag <beitrag ...> ignored because of syntax error (file /resources/corpora/zeitung/source/zeitung.tagged, line #1400718).

The encoding command is as follows: cwb-encode -c utf8 -t /resources/corpora/zeitung/source/zeitung.tagged -d /resources/corpora/zeitung -R /resources/registry/zeitung -xsB -P pos -P lemma -P ressort -S s:0 -S p:0 -S beitrag:0+jahr+land+datum+ressort+ausgabe+seite+stichwort+titel+unternehmen+kateg1+quelle+untertitel+kateg2+autor

A typical entry looks as follows: <beitrag jahr="9999" land="USA C1USA" datum="99999999" ressort="Reise" ausgabe="99" seite="99" stichwort="Lufthansa/Streik" titel="Flug LH 99: gestrichen" unternehmen="" kateg1="Urlaub" quelle="zeitung 99/9999 vom 99.99.9999" untertitel="" kateg2="" autor="nachname, vorname"/>

I'd be very thankful for help, as there are several thousand such error messages, I really do not know what problem to look for.

Kind regards & thanks a lot
Gertrud
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list