[CWB] Difference in token number between CQP and CQPweb

Stefan Evert stefanML at collocations.de
Fri Feb 14 13:49:04 CET 2014


On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:

> I just realized a difference in the token numbers between CQP and CQPweb.
> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
> 
> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
> 
> The difference is also present if you look at subcorpora.

Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.

Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?

Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?

Cheers,
Stefan





More information about the CWB mailing list