[CWB] Counting tokens in CQP

Teresa Molés Cases teresamoles at gmail.com
Sat Oct 18 12:44:02 CEST 2014


Hi Serge,

Thank you a lot for your answer, but this query does not seem to work in my corpus. Could you please tell me how can I get the information about the surface forms of punctuations in my corpus? If it is not much effort, of course.

Thanks a lot! Best,

Teresa


El 17/10/2014, a las 23:10, Serge Heiden <slh at ens-lyon.fr> escribió:

> Hi,
> 
> Le 17/10/2014 20:28, Teresa Molés Cases a écrit :
>> I have a question regarding the counting of tokens in CQP. I know that the exact query would be DICKENS> Q1 = []; size Q1;
>> 
>> But I have also read that this search would count not only tokens but also punctuation marks. Is that right?
> Yes
>> Is it possible in CQP to count just tokens (not including punctuation marks)?
> Sure, just ask for something different from a punctuation mark in your query instead of any "word"/token.
> For example : DICKENS> Q1 = [word!="."&word!="''|``"|word="[ai]"%c]; size Q1;
> (to formulate such a query, you need to know the surface forms of punctuations in your corpus)
> 
> Of course it would be better if you run a tagger or a syntactic analyzer on your sources before CQP to tel it
> what property could be used to filter punctuations (and not only 'word' forms).
> 
> You can also filter punctuations from the sources before CQP encode and makeall, in which case your original query will work.
> But a corpus without punctuation is difficult to read. Another strategy is to have two versions of your corpus: one with
> punctuations and one without, depending on the queries you need to run.
> 
> Best,
> Serge
> -- 
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

Teresa Molés Cases
Traductora EN/DE/FR > ES/CAT
teresamoles at gmail.com
667848390




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141018/a669cec9/attachment.html>


More information about the CWB mailing list