[CWB] Strange behaviour of /region[]?
Stefan Evert
stefanML at collocations.de
Fri Feb 26 16:00:09 CET 2010
> my name is Thomas Proisl, I'm a computational linguist from Erlangen
> and I'm
> new to the list ;o).
Welcome to the list! Nice to hear you're using the CWB.
> The corpus contains 6026213 sentences:
> egrep -o '<s[^>]*>' * | wc -l
> 6026213
>
> From what I've read in the manual, I expected
> /region[s]
> or
> <s> []* </s>
> to return all sentences. However:
> A = /region[s];
> size A;
> 6016995
> A = <s> []* </s>;
> size A;
> 6016995
Let's try the easiest explanation first: CQP has a built-in hard limit
on the length of a query match, which defaults to 100 tokens (to keep
queries like "is" []* "a_very_rare_word"; from running forever. So my
guess is that you've got some very long sentences that fail to match
because of this limit.
A reliable and much more efficient way to get a list of all sentences is
A = <s> [] expand to s;
Can you check whether this query returns the expected number of
matches? If you want to stick with the original query, you can change
the hard limit with
set HardBoundary <n>;
Hope this helps & best wishes,
Stefan
More information about the CWB
mailing list