[CWB] Encoding problem in CWB version 3.2.4

Tue Sep 28 10:53:28 CEST 2010

Peter - your output proves it, the cwb-encode input strings are *definitely* being validated with the \n still on the end. We need to allow 0xd and 0xa alongside 0x9 as exceptions in INVALID_CRTL and that should fix this, along with the bug Tomaz initally reported.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Peter Ljunglöf
Sent: 28 September 2010 09:29
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Encoding problem in CWB version 3.2.4

28 sep 2010 kl. 10.18 skrev Hardie, Andrew:

> Oh that's good, because I think I know what is causing the first bug,
> though I haven't tested. 
> 
> @Stefan - it's probably the INVALID_CTRL macro in special-chars. It
> treats anything under 0x20 as invalid, which I think you put in because
> you can't have \n or \r in CWB tokens. 
> 
> #define INVALID_CTRL(c) (c < 0x20 && c != 0x09)
> 
> BUT when that same function is used to validate query strings it will
> fail - they have a \n on the end because the user just pressed ENTER. I
> *think* that's it, anyway. The parser code is a source of immense and
> continuing mysrtery to me.

That sounds reasonable! If I try to create a Latin-1 corpus, I have to use the -C option, and then every token ends with a "?", like this:

TEST> [];
        0:                           <Hi?> my? name? is? Peter? 
        1:                       Hi? <my?> name? is? Peter? 
        2:                   Hi? my? <name?> is? Peter? 
        3:             Hi? my? name? <is?> Peter? 
        4:         Hi? my? name? is? <Peter?> 

/Peter

> (Also, I believe I made cwb-encode run the validation check on entire
> input lines, not on individual tokens - so that may also now be a bug if
> the line break is not trimmed off before that happens.)
> 
> Andrew.
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
> On Behalf Of Stefan Evert
> Sent: 28 September 2010 09:06
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Encoding problem in CWB version 3.2.4
> 
> Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)
> 
>> this is probably some silly error, but I can't figure out where I made
> it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in
> utf-8 and I used cwb-encode -c utf8
>> But when I try using it:
>> 
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> JOS100K-EN;
>> JOS100K-EN> "kaj";
>> CQP Error:
>>       Query includes a character or character sequence that is
> invalid
>> in the encoding specified for this corpus
>> JOS100K-EN>
>> 
>> which is strange, and the query uses only ASCII. Any hints?
> 
> This is quite mysterious to me, too.  Did you check that the declared
> encoding in the registry file is "utf8"?
> 
> The only possible explanation I can imagine is that CQP thinks it's an
> 8-bit encoding somehow ASCII control characters end up in the string
> (perhaps some problem with your terminal?) -- at least the newest
> versions of the CWB check aggressively for invalid control chars.
> 
> If you're really working with UTF-8, then GLib thinks your string isn't
> valid -- which GLib version do you use and how did you install it?
> 
> Of course, Andrew probably has a much better idea about what the cause
> of the problem could be ...
> 
>> Also, I noticed that if no corpus is selected, cqp dies:
>> 
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> "kdo";
>> CQP Error:
>>       No corpus activated
>> Segmentation fault
>> [tomaz at mantra ~]$
> 
> I can reproduce this bug on Mac OS X, fixed now in the SVN.
> 
> @Andrew: it was a simple mistake in prepare_Query(), where these lines
> 
>>  /* validate character encoding according to that corpus, now we know
> it's loaded */
>>  if (!cl_string_validate_encoding(QueryBuffer,
> current_corpus->corpus->charset, 0)) {
>>    cqpmessage(Error, "Query includes a character or character
> sequence that is invalid\n"
>>        "in the encoding specified for this corpus");
>>    generate_code = 0;
>>  }
>> 
> 
> should go into the "if (generate_code)" block, otherwise they'll try to
> obtain the charset current_corpus->corpus->charset from a NULL pointer
> if no corpus has been activated yet.
> 
> (Yes, I admit that the whole structure of that function, with
> generate_code=0 etc., is totally weird. It wasn't my idea. :-)
> 
> Best wishes,
> Stefan
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb