[CWB] problems encoding xml header data

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 5 15:35:48 CEST 2014


Aaaaaah - the light is dawning now. One of us (probably me) has actually started adding support for empty elements:

1831-1833:
        /* first non-valid XML element name character must be whitespace or '>' or '/' (for empty XML element) */
        if (! ((buf[i] == ' ') || (buf[i] == '\t') || (buf[i] == '>') || (buf[i] == '/')) ) 
          i = k;                /* no valid element name found */

... but it does not seem to be finished, as a few lines down, there are only two possibilities (open, close) and not open/close/empty.

I have, I think, now completed empty element support: cwb-encode should obediently create an s-attribute instance with identical start and end cpos.

              if (k == 1) {     /* XML start tag or empty tag */
                i++;            /* identify annotation string, i.e. tag attributes (if there are any) */
                while ((buf[i] == ' ') || (buf[i] == '\t')) /* skip whitespace between element name and first attribute */
                  i++;
                j = i + strlen(buf+i); /* find last '>' character on line */
                while ((j > i) && (buf[j] != '>'))
                  j--;
                if (buf[j-1] == '/') {
                  /* empty tag : open and close */
                  buf[j-1] = '\0';
                  range_open(&ranges[rng], line, buf+i);
                  range_close(&ranges[rng], line);
                }
                else {
                  /* start tag: open */
                  buf[j] = '\0';
                  range_open(&ranges[rng], line, buf+i);
                }
              }
              else {            /* XML end tag */
                range_close(&ranges[rng], line - 1); /* end tag belongs to previous line! */
              }

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 05 May 2014 13:24
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] problems encoding xml header data


On 5 May 2014, at 11:45, Gertrud Faaß <faassg at uni-hildesheim.de> wrote:

> A typical entry looks as follows: <beitrag jahr="9999" land="USA C1USA" datum="99999999" ressort="Reise" ausgabe="99" seite="99" stichwort="Lufthansa/Streik" titel="Flug LH 99: gestrichen" unternehmen="" kateg1="Urlaub" quelle="zeitung 99/9999 vom 99.99.9999" untertitel="" kateg2="" autor="nachname, vorname"/>

Note that this is not a valid _open_ tag, but an empty XML element (< ... />).

CWB doesn't support empty elements, only regular open/close tags that are transformed into structural annotations.  I'm not sure how strict the XML tag parser is and whether it would throw a syntax error in this case.

Hope this helps,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list