[CWB] Difference in token number between CQP and CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Feb 14 15:04:05 CET 2014


Hannah & Stefan,

Can you tell me (a) which function you used to get the CQP word count (b) where you got the CQPweb wordcount (corpus metadata, or concordance infobar)?

The most obvious explanation is that there are tokens outside <text> elements, since CQPweb calculates the size of the corpus by summing the tokens in each individual text. This in turn is based on calculating cpos differences.

But I would like to investigate on my own server first.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 14 February 2014 12:49
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Difference in token number between CQP and CQPweb


On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:

> I just realized a difference in the token numbers between CQP and CQPweb.
> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
> 
> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
> 
> The difference is also present if you look at subcorpora.

Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.

Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?

Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?

Cheers,
Stefan



_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list