[CWB] Counting tokens in CQP

Serge Heiden slh at ens-lyon.fr
Sat Oct 18 13:43:38 CEST 2014


Teresa,
I forgot to also mention Unicode punctuation character classes.
If your corpus is encoded in Unicode, you can express punctuation marks 
character classes on word forms in your queries.
For example, a search for [word="\p{P}+"] should give you all 
punctuations marks of your corpus.
And [word!="\p{P}+"] your tokens.
Best,
Serge

Le 18/10/2014 13:23, Serge Heiden a écrit :
> Teresa,
>
> You need to know how your corpus has been tokenized (segmented into a 
> sequence of tokens and punctuation marks to use your terminology), 
> which is a process done before and outside of CQP.
> If your corpus provides word properties giving information about 
> punctuation status or equivalent you should also be able to access 
> such information.
> If your corpus has no documentation about that, you should ask the 
> provider of the corpus.
> As a last resort, as an approximation at least for roman languages, 
> you can search your corpus for frequent words with a short form.
> For example the most frequent words matching [word="."] are globally 
> punctuation marks, with some mix of grammatical words (auxiliary, 
> pronouns...).
> Then you can explore frequent words of length two: [word=".."], etc.
> This is why I suggested to search for words of length longer than one 
> character: [word!="."]
>
> Best,
> Serge
>
> Le 18/10/2014 12:44, Teresa Molés Cases a écrit :
>> Hi Serge,
>>
>> Thank you a lot for your answer, but this query does not seem to work 
>> in my corpus. Could you please tell me how can I get the information 
>> about the surface forms of punctuations in my corpus? If it is not 
>> much effort, of course.
>>
>> Thanks a lot! Best,
>>
>> Teresa
>>
>>
>> El 17/10/2014, a las 23:10, Serge Heiden <slh at ens-lyon.fr 
>> <mailto:slh at ens-lyon.fr>> escribió:
>>
>>> Hi,
>>>
>>> Le 17/10/2014 20:28, Teresa Molés Cases a écrit :
>>>> I have a question regarding the counting of tokens in CQP. I know 
>>>> that the exact query would be DICKENS> Q1 = []; size Q1;
>>>>
>>>> But I have also read that this search would count not only tokens 
>>>> but also punctuation marks. Is that right?
>>> Yes
>>>> Is it possible in CQP to count just tokens (not including 
>>>> punctuation marks)?
>>> Sure, just ask for something different from a punctuation mark in 
>>> your query instead of any "word"/token.
>>> For example : DICKENS> Q1 = [word!="."&word!="''|``"|word="[ai]"%c]; 
>>> size Q1;
>>> (to formulate such a query, you need to know the surface forms of 
>>> punctuations in your corpus)
>>>
>>> Of course it would be better if you run a tagger or a syntactic 
>>> analyzer on your sources before CQP to tel it
>>> what property could be used to filter punctuations (and not only 
>>> 'word' forms).
>>>
>>> You can also filter punctuations from the sources before CQP encode 
>>> and makeall, in which case your original query will work.
>>> But a corpus without punctuation is difficult to read. Another 
>>> strategy is to have two versions of your corpus: one with
>>> punctuations and one without, depending on the queries you need to run.
>>>
>>> Best,
>>> Serge
>>> -- 
>>> Dr. Serge Heiden,slh at ens-lyon.fr,http://textometrie.ens-lyon.fr
>>> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
>>> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it <mailto:CWB at sslmit.unibo.it>
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>> Teresa Molés Cases
>> Traductora EN/DE/FR > ES/CAT
>> teresamoles at gmail.com <mailto:teresamoles at gmail.com>
>> 667848390
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
> -- 
> Dr. Serge Heiden,slh at ens-lyon.fr,http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883

-------------- section suivante --------------
Une pi?ce jointe HTML a ?t? nettoy?e...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141018/cff1a303/attachment-0001.html>


More information about the CWB mailing list