[CWB] Strange behaviour of /region[]?
Thomas Proisl
tsproisl at linguistik.uni-erlangen.de
Fri Feb 26 09:57:24 CET 2010
Hello everybody,
my name is Thomas Proisl, I'm a computational linguist from Erlangen and I'm
new to the list ;o).
I have a question concerning the behaviour of the /region[] command. First a
little bit of background information:
I have a corpus that consists of XML encoded files that validate against the
following simple RNC schema:
start = element file {
attribute name { xsd:ID },
element s {
attribute id { xsd:ID },
attribute len { xsd:int},
text
}+
}
The corpus contains 6026213 sentences:
egrep -o '<s[^>]*>' * | wc -l
6026213
From what I've read in the manual, I expected
/region[s]
or
<s> []* </s>
to return all sentences. However:
A = /region[s];
size A;
6016995
A = <s> []* </s>;
size A;
6016995
cwb-s-decode on the other hand produces the expected number of 6026213 corpus
positions:
cwb-s-decode TEST -S s | wc -l
6026213
Why does /region[s] return only 6016995 sentences? Is this normal behaviour?
Best regards
Thomas Proisl
--
Department Germanistik und Komparatistik
Professur für Computerlinguistik
Bismarckstr. 6
91054 Erlangen
Tel.: 09131 85-25908
Fax: 09131 85-29251
http://www.linguistik.uni-erlangen.de/clue/de/personen/thomas-proisl-ma.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
Url : http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20100226/9133d763/attachment.bin
More information about the CWB
mailing list