[CWB] Strange behaviour of /region[]?

Thomas Proisl tsproisl at linguistik.uni-erlangen.de
Fri Feb 26 09:57:24 CET 2010


Hello everybody,

my name is Thomas Proisl, I'm a computational linguist from Erlangen and I'm 
new to the list ;o).

I have a question concerning the behaviour of the /region[] command. First a 
little bit of background information:

I have a corpus that consists of XML encoded files that validate against the 
following simple RNC schema:
start = element file {
      attribute name { xsd:ID },
      element s {
              attribute id { xsd:ID },
              attribute len { xsd:int},
              text
      }+
}

The corpus contains 6026213 sentences:
egrep -o '<s[^>]*>' * | wc -l
6026213

From what I've read in the manual, I expected
/region[s]
or
<s> []* </s>
to return all sentences. However:
A = /region[s];
size A;
6016995
A = <s> []* </s>;
size A;
6016995

cwb-s-decode on the other hand produces the expected number of 6026213 corpus 
positions:
cwb-s-decode TEST -S s | wc -l
6026213

Why does /region[s] return only 6016995 sentences? Is this normal behaviour?

Best regards
Thomas Proisl


-- 
Department Germanistik und Komparatistik
Professur für Computerlinguistik
Bismarckstr. 6
91054 Erlangen

Tel.: 09131 85-25908
Fax:  09131 85-29251
http://www.linguistik.uni-erlangen.de/clue/de/personen/thomas-proisl-ma.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
Url : http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20100226/9133d763/attachment.bin


More information about the CWB mailing list