[CWB] size of a subcorpus - how?

Pavel Vondřička Pavel.Vondricka at ff.cuni.cz
Mon Apr 15 23:07:21 CEST 2013


Hi,
yes, that is a very dirty way, I think. At the moment I prefer the one I
tried using rcqp: compute it from the subcorpus dump in R. I am still
mostly puzzled that CWB is missing such a basic functionality at all.

Best regards,
Pavel

P.S.: Any news from María? I have not heard from her for a very long
time... [Oh, sorry, let's use private reply for this subject!]


> Hi, Pavel:
> My dirty way is tolaunch a query with the restrictions for that
> subcorpus. Take the EUROPARL-EN, element speaker has an attribute called
> language, indicating the source language of the tokens contained in that
> element. If I only want the tokens in English y run this query:
> 
>     []:: match.speaker_language="DE";
> 
> If you do:
> 
>     size Last;
> 
> You get the size intokens, in this case 5532412.
> 
> When I want to calculate the same but for all the subcorpora at once(in
> my case all subcorpora according to the source language):
> 
>     [];
> 
>     groupLast match verbalization_language;
> 
> Then you get a table similar to:
> 
> DE    5532412
> FR    4921250
> NL    3003754
> ES    2772929
> IT    2407213
> PT    1665839
> EL    1382710
> SV    1378828
> DA    698575
> FI    571006
> PL    363083
> ...    ...
> 
> I hope it helps!
> 
> Best,
> 
> jmm
> 
> El 15/04/13 20:53, Pavel Vondřička escribió:
>> Excuse me, but is there any way to get the size of a subcorpus in tokens? 
>> Somehow I cannot find such a basic thing, sorry.
>>
>> Thanks,
>> Pavel
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> 
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 



More information about the CWB mailing list