[CWB] problems encoding xml header data
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon May 5 15:35:48 CEST 2014
Aaaaaah - the light is dawning now. One of us (probably me) has actually started adding support for empty elements:
1831-1833:
/* first non-valid XML element name character must be whitespace or '>' or '/' (for empty XML element) */
if (! ((buf[i] == ' ') || (buf[i] == '\t') || (buf[i] == '>') || (buf[i] == '/')) )
i = k; /* no valid element name found */
... but it does not seem to be finished, as a few lines down, there are only two possibilities (open, close) and not open/close/empty.
I have, I think, now completed empty element support: cwb-encode should obediently create an s-attribute instance with identical start and end cpos.
if (k == 1) { /* XML start tag or empty tag */
i++; /* identify annotation string, i.e. tag attributes (if there are any) */
while ((buf[i] == ' ') || (buf[i] == '\t')) /* skip whitespace between element name and first attribute */
i++;
j = i + strlen(buf+i); /* find last '>' character on line */
while ((j > i) && (buf[j] != '>'))
j--;
if (buf[j-1] == '/') {
/* empty tag : open and close */
buf[j-1] = '\0';
range_open(&ranges[rng], line, buf+i);
range_close(&ranges[rng], line);
}
else {
/* start tag: open */
buf[j] = '\0';
range_open(&ranges[rng], line, buf+i);
}
}
else { /* XML end tag */
range_close(&ranges[rng], line - 1); /* end tag belongs to previous line! */
}
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 05 May 2014 13:24
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] problems encoding xml header data
On 5 May 2014, at 11:45, Gertrud Faaß <faassg at uni-hildesheim.de> wrote:
> A typical entry looks as follows: <beitrag jahr="9999" land="USA C1USA" datum="99999999" ressort="Reise" ausgabe="99" seite="99" stichwort="Lufthansa/Streik" titel="Flug LH 99: gestrichen" unternehmen="" kateg1="Urlaub" quelle="zeitung 99/9999 vom 99.99.9999" untertitel="" kateg2="" autor="nachname, vorname"/>
Note that this is not a valid _open_ tag, but an empty XML element (< ... />).
CWB doesn't support empty elements, only regular open/close tags that are transformed into structural annotations. I'm not sure how strict the XML tag parser is and whether it would throw a syntax error in this case.
Hope this helps,
Stefan
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list