[CWB] Encoding problem in CWB version 3.2.4
Peter Ljunglöf
peter.ljunglof at gu.se
Tue Sep 28 10:29:20 CEST 2010
28 sep 2010 kl. 10.18 skrev Hardie, Andrew:
> Oh that's good, because I think I know what is causing the first bug,
> though I haven't tested.
>
> @Stefan - it's probably the INVALID_CTRL macro in special-chars. It
> treats anything under 0x20 as invalid, which I think you put in because
> you can't have \n or \r in CWB tokens.
>
> #define INVALID_CTRL(c) (c < 0x20 && c != 0x09)
>
> BUT when that same function is used to validate query strings it will
> fail - they have a \n on the end because the user just pressed ENTER. I
> *think* that's it, anyway. The parser code is a source of immense and
> continuing mysrtery to me.
That sounds reasonable! If I try to create a Latin-1 corpus, I have to use the -C option, and then every token ends with a "?", like this:
TEST> [];
0: <Hi?> my? name? is? Peter?
1: Hi? <my?> name? is? Peter?
2: Hi? my? <name?> is? Peter?
3: Hi? my? name? <is?> Peter?
4: Hi? my? name? is? <Peter?>
/Peter
> (Also, I believe I made cwb-encode run the validation check on entire
> input lines, not on individual tokens - so that may also now be a bug if
> the line break is not trimmed off before that happens.)
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
> On Behalf Of Stefan Evert
> Sent: 28 September 2010 09:06
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Encoding problem in CWB version 3.2.4
>
> Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)
>
>> this is probably some silly error, but I can't figure out where I made
> it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in
> utf-8 and I used cwb-encode -c utf8
>> But when I try using it:
>>
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> JOS100K-EN;
>> JOS100K-EN> "kaj";
>> CQP Error:
>> Query includes a character or character sequence that is
> invalid
>> in the encoding specified for this corpus
>> JOS100K-EN>
>>
>> which is strange, and the query uses only ASCII. Any hints?
>
> This is quite mysterious to me, too. Did you check that the declared
> encoding in the registry file is "utf8"?
>
> The only possible explanation I can imagine is that CQP thinks it's an
> 8-bit encoding somehow ASCII control characters end up in the string
> (perhaps some problem with your terminal?) -- at least the newest
> versions of the CWB check aggressively for invalid control chars.
>
> If you're really working with UTF-8, then GLib thinks your string isn't
> valid -- which GLib version do you use and how did you install it?
>
> Of course, Andrew probably has a much better idea about what the cause
> of the problem could be ...
>
>> Also, I noticed that if no corpus is selected, cqp dies:
>>
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> "kdo";
>> CQP Error:
>> No corpus activated
>> Segmentation fault
>> [tomaz at mantra ~]$
>
> I can reproduce this bug on Mac OS X, fixed now in the SVN.
>
> @Andrew: it was a simple mistake in prepare_Query(), where these lines
>
>> /* validate character encoding according to that corpus, now we know
> it's loaded */
>> if (!cl_string_validate_encoding(QueryBuffer,
> current_corpus->corpus->charset, 0)) {
>> cqpmessage(Error, "Query includes a character or character
> sequence that is invalid\n"
>> "in the encoding specified for this corpus");
>> generate_code = 0;
>> }
>>
>
> should go into the "if (generate_code)" block, otherwise they'll try to
> obtain the charset current_corpus->corpus->charset from a NULL pointer
> if no corpus has been activated yet.
>
> (Yes, I admit that the whole structure of that function, with
> generate_code=0 etc., is totally weird. It wasn't my idea. :-)
>
> Best wishes,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet
More information about the CWB
mailing list