[CWB] Efficient way to count frequencies on large data

Stefan Evert stefanML at collocations.de
Fri Dec 18 14:51:42 CET 2015


> On 18 Dec 2015, at 14:37, Sébastien Jacquot <sebastien.jacquot at univ-fcomte.fr> wrote:
> 
> I'm looking for an efficient way to get the frequencies of repeated token sequences on large corpora.
> At this moment I use:
> R = ([][][][]);
> count R by word cut 20;
> 
> Is there a faster way to do that in terms of performances? (I mean for example by directly grouping and counting the results rather than getting all the results and then count them?)

Yep, there is: the cwb-scan-corpus command-line tool.  It's both faster and more memory-efficient (usually).

e.g.

	cwb-scan-corpus -o 4grams.txt.gz MYCORPUS -f 10 word+0 word+1 word+2 word+3

with a frequency threshold of f >= 10 occurrences.  Note that the n-grams are unsorted in the output file; use the "sort" command-line tool if you want a frequency-sorted output as from the "count" command.

Best,
Stefan


More information about the CWB mailing list