[CWB] Difference in token number between CQP and CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Feb 19 04:40:51 CET 2014


OK, so would everybody be on board with an increase to, say, 100 bytes for handles? Or should I just go all the way to 767 (which is maximum length for an indexed column in InnoDB, the new default table format in MySQL)?

Andrew.

(PS - Note Hannah that re " Besides, we have text metadata categories which can easily exceed the number of chars such as journal titles, etc." - handles will still have to be a C-word, that is, /^[a-zA-Z_][a-zA-Z0-9_]*$/ , so you will never be able to use an *unprocessed* journal title as a text metadata category handle. Extended descriptions for categories can already be up to 255 characters and are not restricted to being alphanumeric. The preferred approach is to use short codes for handles and if long, descriptive values are required, load them as descriptions.)

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hannah Kermes
Sent: 18 February 2014 22:33
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Difference in token number between CQP and CQPweb

First of all thanks to you two for coming up with the solution.

and I can only underline Stefans response.
We use automatically generated ID as well.  At the moment usually AuthorYear combinations, sometimes titles. I don't think we can top the ID length of Stefan, but we definitely exceed 20 chars sometimes.
Besides, we have text metadata categories which can easily exceed the number of chars such as journal titles, etc.
It would be possible to limit them to the given char number, but I also would prefer not to do so. It might make lines long, but it makes them more meaningful.

Best
Hannah
Am 18.02.2014 19:26, schrieb Stefan Evert:
>> Do people want really, really long handles? It seems like the sort of thing liable to provoke user complaints ("I can't see my data cos the text IDs have pushed it to the right") but if people want it I am happy to comply......
> Yes, for some of our corpora the kwic display already looks quite awkward because of the long text IDs.
>
> Long IDs are usually created automatically by taking the human-readable annotations and transforming them into ASCII + underscores.  Here are some (extreme) examples of text IDs in my corpus:
>
> 	film_1965_those_magnificent_men_in_their_flying_machines_or_how_i_flew_from_london_to_paris_in_25_hours_11_minutes
> 	film_2006_borat_cultural_learnings_of_america_for_make_benefit_glorious_nation_of_kazakhstan
> 	film_2010_the_41_year_old_virgin_who_knocked_up_sarah_marshall_and_felt_superbad_about_it
>
> Shortening those IDs isn't trivial because it might easily create duplicates.  And it's nice to have meaningful IDs in the kwic display for this corpus rather than just a cryptic "t048310" (which would be the obvious alternative).
>
>> I am open to going with the consensus on this.
> PRO limits on ID lengths:
>
>   - not really a problem if the limits are documented (in 10-foot-high, flaming letters, preferably) and one has to pre-process the corpus anyway
>
> CON:
>
>   - still a hassle
>   - major incompatibility with virtually all corpus standards such as TEI, XML, ... that put no arbitrary restrictions on IDs
>   - especially the 20-char limits on category IDs are unpleasantly tight and could easily be exceeded by automatically generated IDs; can we at least increase them to 30 chars or so?
>
>
>> Stefan - re file check, yes I can see that.
>> On the list.
> Thanks, that's really important.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list