[CWB] Difference in token number between CQP and CQPweb

Tue Feb 18 18:09:37 CET 2014

Hi everyone!

An attempt to re-install the problematic corpus in CQPweb didn't make any difference: still missing lots of texts and tokens.  My corpus is also encoded in UTF-8, so this cannot be an issue here.

Is there an upper limit on the length of text IDs in CQPweb?  My corpus has long ID strings containing up to 114 characters.  Truncating them internally shouldn't create many duplicates (if any), but if CQPweb somehow fails to match long IDs between the corpus and the metadata table, that would explain everything.

A look at the MySQL database confirms this suspicion: There are 9802 entries for all texts in the table text_metadata_for_subtitles_en, but after creating the begin/end offset positions, 156 entries remain at 0.  The IDs of these entries have been truncated in the MySQL database.

Andrew, has this problem been fixed in a more recent CQPweb release or is it on the list of known bugs?  Is there a good reason to enforce the size limits both on text IDs (50 chars) and category IDs (20 chars) in the metadata?  If so, and if you do not plan to allow longer IDs, I think CQPweb should check this and throw an error when installing the corpus, rather than producing incorrect results later on.

On 14 Feb 2014, at 16:26, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:

> Hi Andrew,
> 
> It seems that for 10 (out of 310) texts, the word count is wrong.
> I simply looked for all tokens ("[]") and made a frequency distribution across texts.
> The result was:
> Your query “[]” returned 2,076,963 matches in 310 different texts (in 1,961,752 words [310 texts]; frequency: 1058728.63 instances per million words).