[CWB] Difference in token number between CQP and CQPweb
Stefan Evert
stefanML at collocations.de
Fri Feb 14 13:49:04 CET 2014
On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:
> I just realized a difference in the token numbers between CQP and CQPweb.
> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
>
> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
>
> The difference is also present if you look at subcorpora.
Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed. Everything else is fine.
Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?
Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?
Cheers,
Stefan
More information about the CWB
mailing list