[CWB] Difference in token number between CQP and CQPweb

Stefan Evert stefanML at collocations.de
Tue Feb 18 19:26:31 CET 2014


> Do people want really, really long handles? It seems like the sort of thing liable to provoke user complaints ("I can't see my data cos the text IDs have pushed it to the right") but if people want it I am happy to comply......

Yes, for some of our corpora the kwic display already looks quite awkward because of the long text IDs.

Long IDs are usually created automatically by taking the human-readable annotations and transforming them into ASCII + underscores.  Here are some (extreme) examples of text IDs in my corpus:

	film_1965_those_magnificent_men_in_their_flying_machines_or_how_i_flew_from_london_to_paris_in_25_hours_11_minutes
	film_2006_borat_cultural_learnings_of_america_for_make_benefit_glorious_nation_of_kazakhstan 
	film_2010_the_41_year_old_virgin_who_knocked_up_sarah_marshall_and_felt_superbad_about_it

Shortening those IDs isn't trivial because it might easily create duplicates.  And it's nice to have meaningful IDs in the kwic display for this corpus rather than just a cryptic "t048310" (which would be the obvious alternative).

> I am open to going with the consensus on this.

PRO limits on ID lengths:

 - not really a problem if the limits are documented (in 10-foot-high, flaming letters, preferably) and one has to pre-process the corpus anyway

CON:

 - still a hassle
 - major incompatibility with virtually all corpus standards such as TEI, XML, ... that put no arbitrary restrictions on IDs
 - especially the 20-char limits on category IDs are unpleasantly tight and could easily be exceeded by automatically generated IDs; can we at least increase them to 30 chars or so?


> Stefan - re file check, yes I can see that.
> On the list.

Thanks, that's really important.

Best,
Stefan


More information about the CWB mailing list