[CWB] An easy way to find out about the no. of tokens in a subcorpus?

Stefan Evert stefanML at collocations.de
Mon Mar 1 12:56:42 CET 2010


> may I ask if there is an easy way to find out about the number of  
> tokens a subcorpus contains? So far, I only know the "size" command  
> telling me the number of matches (or number of sentences when  
> "expand to s" is used).
> Basic idea was to write out the sentences to a file and count them -  
> but maybe there's a cheaper way?

Assuming that you mean a set of sentences by "subcorpus" (on which you  
might perform a sensible subquery), my usual trick is as follows:

	Subcorpus = Subcorpus expand to s;
	tabulate Subcorpus match, matchend > "| awk '{n += $2 - $1 + 1} END  
{print n}'";

Of course, you can also expand to paragraphs or documents if you want  
to build a subcorpus of that granularity.

Best,
Stefan



More information about the CWB mailing list