[CWB] Difference in token number between CQP and CQPweb

Wed Feb 19 16:06:16 CET 2014

Hi Hannah,

If you explicitly deleted the corpus before reinstalling then nothing should have been left over.

Looks as if the bug persists. 

I will investigate further.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hannah Kermes
Sent: 19 February 2014 10:10
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Difference in token number between CQP and CQPweb

I hate to spoil the party, but I shortened the text_ids (to a max of 20
chars) of one of the problematic corpora (in the metadatatable and in the cqpcorpus), re-installed the corpus, and the problem stayed the same, still the same wrong token numbers.
Do I have to delete some other kind of memory?
We also updated (or better newly installed - because something went wrong with the update) cqpweb, we tried to import as much as possible (we managed to import groups, but not the users, unfortunately) - could that have spoiled something?

Best
Hannah

ps: is there a documentation of the new "privileges"? and do we have to 
set something up for the automatic user registration (I made a test and 
still haven't received any email to respond to)
Am 18.02.2014 19:26, schrieb Stefan Evert:
>> Do people want really, really long handles? It seems like the sort of thing liable to provoke user complaints ("I can't see my data cos the text IDs have pushed it to the right") but if people want it I am happy to comply......
> Yes, for some of our corpora the kwic display already looks quite awkward because of the long text IDs.
>
> Long IDs are usually created automatically by taking the human-readable annotations and transforming them into ASCII + underscores.  Here are some (extreme) examples of text IDs in my corpus:
>
> 	film_1965_those_magnificent_men_in_their_flying_machines_or_how_i_flew_from_london_to_paris_in_25_hours_11_minutes
> 	film_2006_borat_cultural_learnings_of_america_for_make_benefit_glorious_nation_of_kazakhstan
> 	film_2010_the_41_year_old_virgin_who_knocked_up_sarah_marshall_and_felt_superbad_about_it
>
> Shortening those IDs isn't trivial because it might easily create duplicates.  And it's nice to have meaningful IDs in the kwic display for this corpus rather than just a cryptic "t048310" (which would be the obvious alternative).
>
>> I am open to going with the consensus on this.
> PRO limits on ID lengths:
>
>   - not really a problem if the limits are documented (in 10-foot-high, flaming letters, preferably) and one has to pre-process the corpus anyway
>
> CON:
>
>   - still a hassle
>   - major incompatibility with virtually all corpus standards such as TEI, XML, ... that put no arbitrary restrictions on IDs
>   - especially the 20-char limits on category IDs are unpleasantly tight and could easily be exceeded by automatically generated IDs; can we at least increase them to 30 chars or so?
>
>
>> Stefan - re file check, yes I can see that.
>> On the list.
> Thanks, that's really important.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb