[CWB] Difference in token number between CQP and CQPweb

Hannah Kermes h.kermes at mx.uni-saarland.de
Tue Feb 18 23:33:11 CET 2014


First of all thanks to you two for coming up with the solution.

and I can only underline Stefans response.
We use automatically generated ID as well.  At the moment usually 
AuthorYear combinations, sometimes titles. I don't think we can top the 
ID length of Stefan, but we definitely exceed 20 chars sometimes.
Besides, we have text metadata categories which can easily exceed the 
number of chars such as journal titles, etc.
It would be possible to limit them to the given char number, but I also 
would prefer not to do so. It might make lines long, but it makes them 
more meaningful.

Best
Hannah
Am 18.02.2014 19:26, schrieb Stefan Evert:
>> Do people want really, really long handles? It seems like the sort of thing liable to provoke user complaints ("I can't see my data cos the text IDs have pushed it to the right") but if people want it I am happy to comply......
> Yes, for some of our corpora the kwic display already looks quite awkward because of the long text IDs.
>
> Long IDs are usually created automatically by taking the human-readable annotations and transforming them into ASCII + underscores.  Here are some (extreme) examples of text IDs in my corpus:
>
> 	film_1965_those_magnificent_men_in_their_flying_machines_or_how_i_flew_from_london_to_paris_in_25_hours_11_minutes
> 	film_2006_borat_cultural_learnings_of_america_for_make_benefit_glorious_nation_of_kazakhstan
> 	film_2010_the_41_year_old_virgin_who_knocked_up_sarah_marshall_and_felt_superbad_about_it
>
> Shortening those IDs isn't trivial because it might easily create duplicates.  And it's nice to have meaningful IDs in the kwic display for this corpus rather than just a cryptic "t048310" (which would be the obvious alternative).
>
>> I am open to going with the consensus on this.
> PRO limits on ID lengths:
>
>   - not really a problem if the limits are documented (in 10-foot-high, flaming letters, preferably) and one has to pre-process the corpus anyway
>
> CON:
>
>   - still a hassle
>   - major incompatibility with virtually all corpus standards such as TEI, XML, ... that put no arbitrary restrictions on IDs
>   - especially the 20-char limits on category IDs are unpleasantly tight and could easily be exceeded by automatically generated IDs; can we at least increase them to 30 chars or so?
>
>
>> Stefan - re file check, yes I can see that.
>> On the list.
> Thanks, that's really important.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077



More information about the CWB mailing list