[CWB] RE: Badly-formatted text ID codes

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jan 16 14:21:37 CET 2012


No, because url-encoding allows non-word characters (like % and +); if you rolled your own recoding that avoided those characters, many of the resulting values would still be too long (and truncation might lead to duplicates). You need to add a proper ID attribute.

I can send you the script I wrote to do this in itWaC if you'd like.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Emiliano Guevara
Sent: 16 January 2012 13:15
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] RE: Badly-formatted text ID codes

what about a corpus-wide re-encoding of the Urls in ASCII safe characters?

something like this...

http://www.albionresearch.com/misc/urlencode.php

E.



On Jan 16, 2012, at 13:34 PM, Eros Zanchetta wrote:

> OK, thanks for the tip!
> 
> Best,
> Eros
> 
> On Jan 16, 2012, at 1:21 PM, Hardie, Andrew wrote:
> 
>> There isn't one. You have to have text ids that contain only ascii letters, numbers and underscore.
>> 
>> The "easy" way is to change the element containing the URL to url="" and then add an id alongside that. When I installed itWaC, I just used numbers for the ids.
>> 
>> best
>> 
>> Andrew.
>> 
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Eros Zanchetta
>> Sent: 16 January 2012 12:11
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] Badly-formatted text ID codes
>> 
>> Hi everyone,
>> 
>> I'm trying to install itwac and dewac on cqpweb but I keep getting the following error when I click on "Create minimalist metadata table":
>> 
>> "The data source you specified for the text metadata contains badly-formatted text ID codes"
>> 
>> The text IDs of the corpus are URLs, the problem seems to be that CQPWeb doesn't like underscores and slashes.
>> 
>> Can anyone suggest a workaround that doesn't include changing the text IDs?
>> 
>> Best,
>> Eros Zanchetta_______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list