[CWB] Efficient way to count frequencies on large data

Sébastien Jacquot sebastien.jacquot at univ-fcomte.fr
Fri Dec 18 14:54:16 CET 2015


Thank you very much Stefan, I will try this tool.
Cheers,
Sebastian

Le 18/12/2015 14:51, Stefan Evert a écrit :
>> On 18 Dec 2015, at 14:37, Sébastien Jacquot <sebastien.jacquot at univ-fcomte.fr> wrote:
>>
>> I'm looking for an efficient way to get the frequencies of repeated token sequences on large corpora.
>> At this moment I use:
>> R = ([][][][]);
>> count R by word cut 20;
>>
>> Is there a faster way to do that in terms of performances? (I mean for example by directly grouping and counting the results rather than getting all the results and then count them?)
> Yep, there is: the cwb-scan-corpus command-line tool.  It's both faster and more memory-efficient (usually).
>
> e.g.
>
> 	cwb-scan-corpus -o 4grams.txt.gz MYCORPUS -f 10 word+0 word+1 word+2 word+3
>
> with a frequency threshold of f >= 10 occurrences.  Note that the n-grams are unsorted in the output file; use the "sort" command-line tool if you want a frequency-sorted output as from the "count" command.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
ELLIADD, EA 4661
UFR SLHS - Université de Franche-Comté
30-32 rue Mégevand
25030 Besançon cedex
03.81.66.54.22



More information about the CWB mailing list