[CWB] finding number of types

Stefan Evert stefanML at collocations.de
Tue Dec 13 10:50:53 CET 2016


> On 13 Dec 2016, at 10:29, Stefania Spina <stefania.spina at unistrapg.it> wrote:
> 
> I'm running CWB 3.0 with a corpus of 350 texts, annotated with 350 different <text id"number"> tags.
> Can you suggest a way to extract the number of types for each of the 350 texts? 
> With   cwb-lexdecode -S   I only get the total number of types.

That's because CWB doesn't keep statistics on individual texts, only for the entire corpus.  You'll have to count types for each text yourself.

On a Linux / Mac command line, the following should work:

	cwb-scan-corpus DICKENS text_id+0 word+0 | LANG=C cut -f2 | sort | uniq -c

This will count the literal word forms. If you want lemmatized or case-insensitive counts, annotate them as a p-attribute and use it instead of word+0 above.  Case-insensitive counts can also be achieved with some additional trickery on the command line.

Best,
Stefan


	


More information about the CWB mailing list