[CWB] Counting tokens in CQP

Sat Oct 18 21:13:03 CEST 2014

Thanks a lot for your help, Serge! I will study all this information and I hope to solve the problem.

Best,

Teresa

El 18/10/2014, a las 13:43, Serge Heiden <slh at ens-lyon.fr> escribió:

> Teresa,
> I forgot to also mention Unicode punctuation character classes.
> If your corpus is encoded in Unicode, you can express punctuation marks character classes on word forms in your queries.
> For example, a search for [word="\p{P}+"] should give you all punctuations marks of your corpus.
> And [word!="\p{P}+"] your tokens.
> Best,
> Serge
> 
> Le 18/10/2014 13:23, Serge Heiden a écrit :
>> Teresa,
>> 
>> You need to know how your corpus has been tokenized (segmented into a sequence of tokens and punctuation marks to use your terminology), which is a process done before and outside of CQP.
>> If your corpus provides word properties giving information about punctuation status or equivalent you should also be able to access such information.
>> If your corpus has no documentation about that, you should ask the provider of the corpus.
>> As a last resort, as an approximation at least for roman languages, you can search your corpus for frequent words with a short form.
>> For example the most frequent words matching [word="."] are globally punctuation marks, with some mix of grammatical words (auxiliary, pronouns...).
>> Then you can explore frequent words of length two: [word=".."], etc.
>> This is why I suggested to search for words of length longer than one character: [word!="."]
>> 
>> Best,
>> Serge
>> 
>> Le 18/10/2014 12:44, Teresa Molés Cases a écrit :
>>> Hi Serge,
>>> 
>>> Thank you a lot for your answer, but this query does not seem to work in my corpus. Could you please tell me how can I get the information about the surface forms of punctuations in my corpus? If it is not much effort, of course.
>>> 
>>> Thanks a lot! Best,
>>> 
>>> Teresa
>>> 
>>> 
>>> El 17/10/2014, a las 23:10, Serge Heiden <slh at ens-lyon.fr> escribió:
>>> 
>>>> Hi,
>>>> 
>>>> Le 17/10/2014 20:28, Teresa Molés Cases a écrit :
>>>>> I have a question regarding the counting of tokens in CQP. I know that the exact query would be DICKENS> Q1 = []; size Q1;
>>>>> 
>>>>> But I have also read that this search would count not only tokens but also punctuation marks. Is that right?
>>>> Yes
>>>>> Is it possible in CQP to count just tokens (not including punctuation marks)?
>>>> Sure, just ask for something different from a punctuation mark in your query instead of any "word"/token.
>>>> For example : DICKENS> Q1 = [word!="."&word!="''|``"|word="[ai]"%c]; size Q1;
>>>> (to formulate such a query, you need to know the surface forms of punctuations in your corpus)
>>>> 
>>>> Of course it would be better if you run a tagger or a syntactic analyzer on your sources before CQP to tel it
>>>> what property could be used to filter punctuations (and not only 'word' forms).
>>>> 
>>>> You can also filter punctuations from the sources before CQP encode and makeall, in which case your original query will work.
>>>> But a corpus without punctuation is difficult to read. Another strategy is to have two versions of your corpus: one with
>>>> punctuations and one without, depending on the queries you need to run.
>>>> 
>>>> Best,
>>>> Serge
>>>> -- 
>>>> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
>>>> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
>>>> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> 
>>> Teresa Molés Cases
>>> Traductora EN/DE/FR > ES/CAT
>>> teresamoles at gmail.com
>>> 667848390
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> 
>> 
>> -- 
>> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
>> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
>> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
>> 
>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> 
> -- 
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33622003883
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

Teresa Molés Cases
Traductora EN/DE/FR > ES/CAT
teresamoles at gmail.com
667848390

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20141018/f5f88292/attachment.html>