[CWB] Difference in token number between CQP and CQPweb

Hannah Kermes h.kermes at mx.uni-saarland.de
Fri Feb 14 15:42:00 CET 2014


Hi Andrew
Am 14.02.2014 15:04, schrieb Hardie, Andrew:
> Hannah & Stefan,
>
> Can you tell me (a) which function you used to get the CQP word count (b) where you got the CQPweb wordcount (corpus metadata, or concordance infobar)?
corpus metadata, concordance infobar and distribution (same figures).
>
> The most obvious explanation is that there are tokens outside <text> elements, since CQPweb calculates the size of the corpus by summing the tokens in each individual text. This in turn is based on calculating cpos differences.
I checked that, the query [!text] returns no result both in CQP and CQPweb.
So all tokens should be within <text> elements.

Best
Hannah
>
> But I would like to investigate on my own server first.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
> Sent: 14 February 2014 12:49
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Difference in token number between CQP and CQPweb
>
>
> On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:
>
>> I just realized a difference in the token numbers between CQP and CQPweb.
>> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
>>
>> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
>>
>> The difference is also present if you look at subcorpora.
> Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.
>
> Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?
>
> Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?
>
> Cheers,
> Stefan
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077



More information about the CWB mailing list