[CWB] Difference in token number between CQP and CQPweb

Stefan Evert stefanML at collocations.de
Sat Feb 15 10:24:19 CET 2014


Now you really made me hurt my laptop. :-}

I did the same for my problematic corpus (which is encoded in UTF-8, by the by, so it's not an encoding issue) and go

> Your query “[]” returned 98,511,777 matches in 9,802 different texts (in 96,982,906 words [9,802 texts]; frequency: 1015764.33 instances per million words)


This is still consistent with having an old version of the MySQL frequency databases, but that shouldn't happen if Hannah reinstalled the corpus in CQPweb, should it?


In my case, counts seem to be off in weird ways.  For instance, searching for all tokens in texts with category "Romance" (using a CQP query, so I'm directly accessing the corpus metadata) and then doing a frequency distribution, I get:

> Romance	 7015012	 6912129	 665 out of 680	 985333.88


So CQPweb thinks there are _more_ texts and tokens in this category than the query has found (while the overall token count is too _low_).  The query doesn't have any matches in texts from other categories, either.  In CQP, I do get 680 different texts in the category, too.  So for some reason, the bridge between the CWB corpus and CQPweb's frequency databases is seriously broken.


If I query just the first token from each text with

	<text_genre = "Romance"> []

I get 680 matches in CQPweb, and CQPweb realizes these are all different texts:

> Your query “<text_genre = "Romance"> []” returned 680 matches in 680 different texts (in 96,982,906 words [9,802 texts]; frequency: 7.01 instances per million words)

but the metadata distribution still shows only 665 hits in 665 texts:

> Romance	 7015012	 665	 665 out of 680	 94.8


I'm sorry I don't have time right now to investigate further and/or re-install the corpus in order to see whether this is a reproducible problem or a temporary glitch.

Best,
Stefan




On 14 Feb 2014, at 16:26, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:

> 
> It seems that for 10 (out of 310) texts, the word count is wrong.
> I simply looked for all tokens ("[]") and made a frequency distribution across texts.
> The result was:
> Your query “[]” returned 2,076,963 matches in 310 different texts (in 1,961,752 words [310 texts]; frequency: 1058728.63 instances per million words).



More information about the CWB mailing list