[CWB] Counting tokens in CQP

Serge Heiden slh at ens-lyon.fr
Sat Oct 18 13:23:38 CEST 2014


Teresa,

You need to know how your corpus has been tokenized (segmented into a 
sequence of tokens and punctuation marks to use your terminology), which 
is a process done before and outside of CQP.
If your corpus provides word properties giving information about 
punctuation status or equivalent you should also be able to access such 
information.
If your corpus has no documentation about that, you should ask the 
provider of the corpus.
As a last resort, as an approximation at least for roman languages, you 
can search your corpus for frequent words with a short form.
For example the most frequent words matching [word="."] are globally 
punctuation marks, with some mix of grammatical words (auxiliary, 
pronouns...).
Then you can explore frequent words of length two: [word=".."], etc.
This is why I suggested to search for words of length longer than one 
character: [word!="."]

Best,
Serge

Le 18/10/2014 12:44, Teresa Molés Cases a écrit :
> Hi Serge,
>
> Thank you a lot for your answer, but this query does not seem to work 
> in my corpus. Could you please tell me how can I get the information 
> about the surface forms of punctuations in my corpus? If it is not 
> much effort, of course.
>
> Thanks a lot! Best,
>
> Teresa
>
>
> El 17/10/2014, a las 23:10, Serge Heiden <slh at ens-lyon.fr 
> <mailto:slh at ens-lyon.fr>> escribió:
>
>> Hi,
>>
>> Le 17/10/2014 20:28, Teresa Molés Cases a écrit :
>>> I have a question regarding the counting of tokens in CQP. I know 
>>> that the exact query would be DICKENS> Q1 = []; size Q1;
>>>
>>> But I have also read that this search would count not only tokens 
>>> but also punctuation marks. Is that right?
>> Yes
>>> Is it possible in CQP to count just tokens (not including 
>>> punctuation marks)?
>> Sure, just ask for something different from a punctuation mark in 
>> your query instead of any "word"/token.
>> For example : DICKENS> Q1 = [word!="."&word!="''|``"|word="[ai]"%c]; 
>> size Q1;
>> (to formulate such a query, you need to know the surface forms of 
>> punctuations in your corpus)
>>
>> Of course it would be better if you run a tagger or a syntactic 
>> analyzer on your sources before CQP to tel it
>> what property could be used to filter punctuations (and not only 
>> 'word' forms).
>>
>> You can also filter punctuations from the sources before CQP encode 
>> and makeall, in which case your original query will work.
>> But a corpus without punctuation is difficult to read. Another 
>> strategy is to have two versions of your corpus: one with
>> punctuations and one without, depending on the queries you need to run.
>>
>> Best,
>> Serge
>> -- 
>> Dr. Serge Heiden,slh at ens-lyon.fr,http://textometrie.ens-lyon.fr
>> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
>> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
> Teresa Molés Cases
> Traductora EN/DE/FR > ES/CAT
> teresamoles at gmail.com <mailto:teresamoles at gmail.com>
> 667848390
>
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883

-------------- section suivante --------------
Une pi?ce jointe HTML a ?t? nettoy?e...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141018/08b779fb/attachment.html>


More information about the CWB mailing list