[CWB] size of a subcorpus - how?

Jose Manuel Martinez Martinez jmmtra at gmail.com
Mon Apr 15 22:51:58 CEST 2013


Hi, Pavel:
My dirty way is tolaunch a query with the restrictions for that 
subcorpus. Take the EUROPARL-EN, element speaker has an attribute called 
language, indicating the source language of the tokens contained in that 
element. If I only want the tokens in English y run this query:

     []:: match.speaker_language="DE";

If you do:

     size Last;

You get the size intokens, in this case 5532412.

When I want to calculate the same but for all the subcorpora at once(in 
my case all subcorpora according to the source language):

[];

groupLast match verbalization_language;

Then you get a table similar to:

DE 5532412
FR    4921250
NL    3003754
ES    2772929
IT    2407213
PT    1665839
EL    1382710
SV    1378828
DA    698575
FI    571006
PL    363083
... ...

I hope it helps!

Best,

jmm

El 15/04/13 20:53, Pavel Vondr(ic(ka escribió:
> Excuse me, but is there any way to get the size of a subcorpus in tokens?
> Somehow I cannot find such a basic thing, sorry.
>
> Thanks,
> Pavel
>
> _______________________________________________
> CWB mailing list
> CWB en sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

------------ pr?xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20130415/c15de24f/attachment.html>


More information about the CWB mailing list