[CWB] size of a subcorpus - how?
Jose Manuel Martinez Martinez
jmmtra at gmail.com
Mon Apr 15 22:51:58 CEST 2013
Hi, Pavel:
My dirty way is tolaunch a query with the restrictions for that
subcorpus. Take the EUROPARL-EN, element speaker has an attribute called
language, indicating the source language of the tokens contained in that
element. If I only want the tokens in English y run this query:
[]:: match.speaker_language="DE";
If you do:
size Last;
You get the size intokens, in this case 5532412.
When I want to calculate the same but for all the subcorpora at once(in
my case all subcorpora according to the source language):
[];
groupLast match verbalization_language;
Then you get a table similar to:
DE 5532412
FR 4921250
NL 3003754
ES 2772929
IT 2407213
PT 1665839
EL 1382710
SV 1378828
DA 698575
FI 571006
PL 363083
... ...
I hope it helps!
Best,
jmm
El 15/04/13 20:53, Pavel Vondr(ic(ka escribió:
> Excuse me, but is there any way to get the size of a subcorpus in tokens?
> Somehow I cannot find such a basic thing, sorry.
>
> Thanks,
> Pavel
>
> _______________________________________________
> CWB mailing list
> CWB en sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
------------ pr?xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20130415/c15de24f/attachment.html>
More information about the CWB
mailing list