[CWB] Counting tokens in CQP

Serge Heiden slh at ens-lyon.fr
Fri Oct 17 23:10:11 CEST 2014


Hi,

Le 17/10/2014 20:28, Teresa Molés Cases a écrit :
> I have a question regarding the counting of tokens in CQP. I know that 
> the exact query would be DICKENS> Q1 = []; size Q1;
>
> But I have also read that this search would count not only tokens but 
> also punctuation marks. Is that right?
Yes
> Is it possible in CQP to count just tokens (not including punctuation 
> marks)?
Sure, just ask for something different from a punctuation mark in your 
query instead of any "word"/token.
For example : DICKENS> Q1 = [word!="."&word!="''|``"|word="[ai]"%c]; 
size Q1;
(to formulate such a query, you need to know the surface forms of 
punctuations in your corpus)

Of course it would be better if you run a tagger or a syntactic analyzer 
on your sources before CQP to tel it
what property could be used to filter punctuations (and not only 'word' 
forms).

You can also filter punctuations from the sources before CQP encode and 
makeall, in which case your original query will work.
But a corpus without punctuation is difficult to read. Another strategy 
is to have two versions of your corpus: one with
punctuations and one without, depending on the queries you need to run.

Best,
Serge

-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883

-------------- section suivante --------------
Une pi?ce jointe HTML a ?t? nettoy?e...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141017/23121e2f/attachment.html>


More information about the CWB mailing list