[CWB] Encoding problem in CWB version 3.2.4

Tue Sep 28 10:18:32 CEST 2010

Oh that's good, because I think I know what is causing the first bug,
though I haven't tested. 

@Stefan - it's probably the INVALID_CTRL macro in special-chars. It
treats anything under 0x20 as invalid, which I think you put in because
you can't have \n or \r in CWB tokens. 

#define INVALID_CTRL(c) (c < 0x20 && c != 0x09)

BUT when that same function is used to validate query strings it will
fail - they have a \n on the end because the user just pressed ENTER. I
*think* that's it, anyway. The parser code is a source of immense and
continuing mysrtery to me.

(Also, I believe I made cwb-encode run the validation check on entire
input lines, not on individual tokens - so that may also now be a bug if
the line break is not trimmed off before that happens.)

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
On Behalf Of Stefan Evert
Sent: 28 September 2010 09:06
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Encoding problem in CWB version 3.2.4

Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)

> this is probably some silly error, but I can't figure out where I made
it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in
utf-8 and I used cwb-encode -c utf8
> But when I try using it:
> 
> [tomaz at mantra ~]$ cqp
> [no corpus]> JOS100K-EN;
> JOS100K-EN> "kaj";
> CQP Error:
>        Query includes a character or character sequence that is
invalid
> in the encoding specified for this corpus
> JOS100K-EN>
> 
> which is strange, and the query uses only ASCII. Any hints?

This is quite mysterious to me, too.  Did you check that the declared
encoding in the registry file is "utf8"?

The only possible explanation I can imagine is that CQP thinks it's an
8-bit encoding somehow ASCII control characters end up in the string
(perhaps some problem with your terminal?) -- at least the newest
versions of the CWB check aggressively for invalid control chars.

If you're really working with UTF-8, then GLib thinks your string isn't
valid -- which GLib version do you use and how did you install it?

Of course, Andrew probably has a much better idea about what the cause
of the problem could be ...

> Also, I noticed that if no corpus is selected, cqp dies:
> 
> [tomaz at mantra ~]$ cqp
> [no corpus]> "kdo";
> CQP Error:
>        No corpus activated
> Segmentation fault
> [tomaz at mantra ~]$

I can reproduce this bug on Mac OS X, fixed now in the SVN.

@Andrew: it was a simple mistake in prepare_Query(), where these lines

>   /* validate character encoding according to that corpus, now we know
it's loaded */
>   if (!cl_string_validate_encoding(QueryBuffer,
current_corpus->corpus->charset, 0)) {
>     cqpmessage(Error, "Query includes a character or character
sequence that is invalid\n"
>         "in the encoding specified for this corpus");
>     generate_code = 0;
>   }
> 

should go into the "if (generate_code)" block, otherwise they'll try to
obtain the charset current_corpus->corpus->charset from a NULL pointer
if no corpus has been activated yet.

(Yes, I admit that the whole structure of that function, with
generate_code=0 etc., is totally weird. It wasn't my idea. :-)

Best wishes,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb