[CWB] finding number of types

Stefania Spina stefania.spina at unistrapg.it
Tue Dec 13 11:07:10 CET 2016


Thank you so much Stefan, it's exactly what I was looking for :-)
Best,
Stefania

2016-12-13 10:50 GMT+01:00 Stefan Evert <stefanML at collocations.de>:

>
> > On 13 Dec 2016, at 10:29, Stefania Spina <stefania.spina at unistrapg.it>
> wrote:
> >
> > I'm running CWB 3.0 with a corpus of 350 texts, annotated with 350
> different <text id"number"> tags.
> > Can you suggest a way to extract the number of types for each of the 350
> texts?
> > With   cwb-lexdecode -S   I only get the total number of types.
>
> That's because CWB doesn't keep statistics on individual texts, only for
> the entire corpus.  You'll have to count types for each text yourself.
>
> On a Linux / Mac command line, the following should work:
>
>         cwb-scan-corpus DICKENS text_id+0 word+0 | LANG=C cut -f2 | sort |
> uniq -c
>
> This will count the literal word forms. If you want lemmatized or
> case-insensitive counts, annotate them as a p-attribute and use it instead
> of word+0 above.  Case-insensitive counts can also be achieved with some
> additional trickery on the command line.
>
> Best,
> Stefan
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>



-- 
Prof. Stefania Spina
Università per Stranieri di Perugia
Dipartimento di Scienze Umane e Sociali
stefania.spina at unistrapg.it
https://unistrapg.academia.edu/StefaniaSpina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20161213/b25bd401/attachment.html>


More information about the CWB mailing list