[CWB] An easy way to find out about the no. of tokens in a
subcorpus?
Stefan Evert
stefanML at collocations.de
Mon Mar 1 12:56:42 CET 2010
> may I ask if there is an easy way to find out about the number of
> tokens a subcorpus contains? So far, I only know the "size" command
> telling me the number of matches (or number of sentences when
> "expand to s" is used).
> Basic idea was to write out the sentences to a file and count them -
> but maybe there's a cheaper way?
Assuming that you mean a set of sentences by "subcorpus" (on which you
might perform a sensible subquery), my usual trick is as follows:
Subcorpus = Subcorpus expand to s;
tabulate Subcorpus match, matchend > "| awk '{n += $2 - $1 + 1} END
{print n}'";
Of course, you can also expand to paragraphs or documents if you want
to build a subcorpus of that granularity.
Best,
Stefan
More information about the CWB
mailing list