[CWB] Corpus statistics

Sun Mar 23 14:52:20 CET 2014

> I was wondering if I had missed something when reading CWB documentation or there does not exist any trivial way to generate per text corpus statistics (eg. text_id, text_author, word_count, types_count etc.). I have already tried both external  (cwb-scan-corpus) and internal (query = []; then tabulate) approach, but without major success. I have also started to analyse CQPWeb php scripts in order to see how it populates mysql tables with frequency data, but it is not precisely what I was looking for (I am still digging, though).

You seem to be asking for two different things here:

(a) A metadata table that associates each text (ID) with text-level metadata such as "author", "genre", etc.

(b) Various kinds of type-token statistics for each text.  

Concerning (a), if the metadata are encoded in the CWB index, you can easily "tabulate" them within CQP:

> Texts = <text> [];
> tabulate Texts match text_id, match text_author, ... ;

Concerning (b), the CWB doesn't keep track of per-text corpus statistics (and it doesn't have a notion of "text" in the first place, anyway).  CQPweb keeps full word frequency counts for each text in its internal database, from which most type-token-statistics can be derived.  I'm not sure if there's a way to access them directly, though.

To generate the necessary counts, CQPweb runs through the full corpus and collects the frequency counts in hash variables.  As Yannick suggested, it is fairly easy to do this from Perl, Python or R using the low-level corpus access APIs.  You can also do this quite efficiently from the command line with cwb-scan-corpus:

	cwb-scan-corpus -o text_word_counts.gz CORPUS text_id+0 word+0

will produce frequency counts for every combination of text_id and word form that occurs in the corpus.  They are saved in unsorted order, so you'll probably want to sort by the second column (the text ID):

	cwb-scan-corpus CORPUS text_id+0 word+0 | sort -k2,2 -k1,1nr  | gzip > text_word_counts.gz 

For type counts, check how often each text ID occurs in the table:

	gzip -cd  text_word_counts.gz | cut -f2 | uniq -c

For token counts, add up the word frequencies for each text ID, or get them directly from cwb-s-decode:

	cwb-s-decode CORPUS -S text_id | awk '{print $2 - $1 + 1, $3}'

Hope this helps,
Stefan