[CWB] Corpus statistics

Sun Mar 23 22:04:43 CET 2014

Dear Yannick and Stefan,

thank you both for your feedback. I have implemented Stefan's algorithm
(many thanks!) in R to calculate statistics from CWB generated frequency
tables and they work just great. I am working on a 100-million token
corpus, so dealing with raw data (corpus read into an R data frame) has
been always extremely tedious which made any calculation hardly feasible. I
have already used "cwb-scan-corpus", as well as the "tabulate" solution,
however, astonished by how fast cwb-scan-corpus is, I thought that there
might exist some undocumented way to do this in one pass. Your
clarification helped me a lot!

Thank you once again for your help
Chris

2014-03-23 14:52 GMT+01:00 Stefan Evert <stefanML at collocations.de>:

>
> > I was wondering if I had missed something when reading CWB documentation
> or there does not exist any trivial way to generate per text corpus
> statistics (eg. text_id, text_author, word_count, types_count etc.). I have
> already tried both external  (cwb-scan-corpus) and internal (query = [];
> then tabulate) approach, but without major success. I have also started to
> analyse CQPWeb php scripts in order to see how it populates mysql tables
> with frequency data, but it is not precisely what I was looking for (I am
> still digging, though).
>
> You seem to be asking for two different things here:
>
> (a) A metadata table that associates each text (ID) with text-level
> metadata such as "author", "genre", etc.
>
> (b) Various kinds of type-token statistics for each text.
>
>
> Concerning (a), if the metadata are encoded in the CWB index, you can
> easily "tabulate" them within CQP:
>
> > Texts = <text> [];
> > tabulate Texts match text_id, match text_author, ... ;
>
>
> Concerning (b), the CWB doesn't keep track of per-text corpus statistics
> (and it doesn't have a notion of "text" in the first place, anyway).
>  CQPweb keeps full word frequency counts for each text in its internal
> database, from which most type-token-statistics can be derived.  I'm not
> sure if there's a way to access them directly, though.
>
> To generate the necessary counts, CQPweb runs through the full corpus and
> collects the frequency counts in hash variables.  As Yannick suggested, it
> is fairly easy to do this from Perl, Python or R using the low-level corpus
> access APIs.  You can also do this quite efficiently from the command line
> with cwb-scan-corpus:
>
>         cwb-scan-corpus -o text_word_counts.gz CORPUS text_id+0 word+0
>
> will produce frequency counts for every combination of text_id and word
> form that occurs in the corpus.  They are saved in unsorted order, so
> you'll probably want to sort by the second column (the text ID):
>
>         cwb-scan-corpus CORPUS text_id+0 word+0 | sort -k2,2 -k1,1nr  |
> gzip > text_word_counts.gz
>
> For type counts, check how often each text ID occurs in the table:
>
>         gzip -cd  text_word_counts.gz | cut -f2 | uniq -c
>
> For token counts, add up the word frequencies for each text ID, or get
> them directly from cwb-s-decode:
>
>         cwb-s-decode CORPUS -S text_id | awk '{print $2 - $1 + 1, $3}'
>
>
> Hope this helps,
> Stefan
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140323/8cdd2ad5/attachment.html>