[CWB] size of a subcorpus - how?
Pavel Vondřička
Pavel.Vondricka at ff.cuni.cz
Mon Apr 15 23:07:21 CEST 2013
Hi,
yes, that is a very dirty way, I think. At the moment I prefer the one I
tried using rcqp: compute it from the subcorpus dump in R. I am still
mostly puzzled that CWB is missing such a basic functionality at all.
Best regards,
Pavel
P.S.: Any news from María? I have not heard from her for a very long
time... [Oh, sorry, let's use private reply for this subject!]
> Hi, Pavel:
> My dirty way is tolaunch a query with the restrictions for that
> subcorpus. Take the EUROPARL-EN, element speaker has an attribute called
> language, indicating the source language of the tokens contained in that
> element. If I only want the tokens in English y run this query:
>
> []:: match.speaker_language="DE";
>
> If you do:
>
> size Last;
>
> You get the size intokens, in this case 5532412.
>
> When I want to calculate the same but for all the subcorpora at once(in
> my case all subcorpora according to the source language):
>
> [];
>
> groupLast match verbalization_language;
>
> Then you get a table similar to:
>
> DE 5532412
> FR 4921250
> NL 3003754
> ES 2772929
> IT 2407213
> PT 1665839
> EL 1382710
> SV 1378828
> DA 698575
> FI 571006
> PL 363083
> ... ...
>
> I hope it helps!
>
> Best,
>
> jmm
>
> El 15/04/13 20:53, Pavel Vondřička escribió:
>> Excuse me, but is there any way to get the size of a subcorpus in tokens?
>> Somehow I cannot find such a basic thing, sorry.
>>
>> Thanks,
>> Pavel
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
More information about the CWB
mailing list