[CWB] Difference in token number between CQP and CQPweb

Tue Feb 18 18:48:46 CET 2014

Aaaahhhhhh, that explains it.

Text IDs are indeed limited to 50 chars. Other handles are limited to 20. This includes everything - corpus SQL/CWB IDs, p/s-attribute identifiers, metadata field handles, metadata category values.

This was part of the change I was working towards over the New Year: to limit handles to 20 chars, but allow descriptive text snippets to 255 chars. This was in place of the rather ad hoc field widths that had accumulated over time.

Handles are limited for two reasons. First, to enable at least some predictability of display (e.g. text ids appear in the first col of the concordance table). Second, because they are repeated in cross-table references (because I was too witless to use primary/foreign key integer fields when I started writing CQPweb), e.g. every time a text references a metadata category, that's X bytes of handle string, and I was trying to keep that down in size. 

But I wasn't committed to that, in fact I thought that 20 was a bit mean, but I stuck with it out of inertia.

Do people want really, really long handles? It seems like the sort of thing liable to provoke user complaints ("I can't see my data cos the text IDs have pushed it to the right") but if people want it I am happy to comply......

I am open to going with the consensus on this.

Stefan - re file check, yes I can see that.

On the list.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 18 February 2014 17:10
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Difference in token number between CQP and CQPweb

Hi everyone!

An attempt to re-install the problematic corpus in CQPweb didn't make any difference: still missing lots of texts and tokens.  My corpus is also encoded in UTF-8, so this cannot be an issue here.

Is there an upper limit on the length of text IDs in CQPweb?  My corpus has long ID strings containing up to 114 characters.  Truncating them internally shouldn't create many duplicates (if any), but if CQPweb somehow fails to match long IDs between the corpus and the metadata table, that would explain everything.

A look at the MySQL database confirms this suspicion: There are 9802 entries for all texts in the table text_metadata_for_subtitles_en, but after creating the begin/end offset positions, 156 entries remain at 0.  The IDs of these entries have been truncated in the MySQL database.

Andrew, has this problem been fixed in a more recent CQPweb release or is it on the list of known bugs?  Is there a good reason to enforce the size limits both on text IDs (50 chars) and category IDs (20 chars) in the metadata?  If so, and if you do not plan to allow longer IDs, I think CQPweb should check this and throw an error when installing the corpus, rather than producing incorrect results later on.

On 14 Feb 2014, at 16:26, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:

> Hi Andrew,
> 
> It seems that for 10 (out of 310) texts, the word count is wrong.
> I simply looked for all tokens ("[]") and made a frequency distribution across texts.
> The result was:
> Your query "[]" returned 2,076,963 matches in 310 different texts (in 1,961,752 words [310 texts]; frequency: 1058728.63 instances per million words).

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb