[CWB] problems encoding xml header data

"Gertrud Faaß" faassg at uni-hildesheim.de
Mon May 5 11:45:49 CEST 2014


Dear all,
I am trying to encode data from a news paper with cwb-encode while trying to keep all the available metadata flags. However, though I think I got the syntax right (? see below), cwb-encode tells me that there are syntax errors. Unfortunately, the error message does not give any more details. Could it be that there are limitations to characters that may appear in such xml data, i.e. could it possibly cause problems, when ":" or "/" appear in such attribute fields? 

The error message looks as follows: Attributes of open tag <beitrag ...> ignored because of syntax error (file /resources/corpora/zeitung/source/zeitung.tagged, line #1400718).

The encoding command is as follows: cwb-encode -c utf8 -t /resources/corpora/zeitung/source/zeitung.tagged -d /resources/corpora/zeitung -R /resources/registry/zeitung -xsB -P pos -P lemma -P ressort -S s:0 -S p:0 -S beitrag:0+jahr+land+datum+ressort+ausgabe+seite+stichwort+titel+unternehmen+kateg1+quelle+untertitel+kateg2+autor

A typical entry looks as follows: <beitrag jahr="9999" land="USA C1USA" datum="99999999" ressort="Reise" ausgabe="99" seite="99" stichwort="Lufthansa/Streik" titel="Flug LH 99: gestrichen" unternehmen="" kateg1="Urlaub" quelle="zeitung 99/9999 vom 99.99.9999" untertitel="" kateg2="" autor="nachname, vorname"/>

I'd be very thankful for help, as there are several thousand such error messages, I really do not know what problem to look for.

Kind regards & thanks a lot
Gertrud


More information about the CWB mailing list