[CWB] Encoding problem in CWB version 3.2.4

Stefan Evert stefanML at collocations.de
Tue Sep 28 10:06:29 CEST 2010


Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)

> this is probably some silly error, but I can't figure out where I made it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in utf-8 and I used cwb-encode -c utf8
> But when I try using it:
> 
> [tomaz at mantra ~]$ cqp
> [no corpus]> JOS100K-EN;
> JOS100K-EN> "kaj";
> CQP Error:
>        Query includes a character or character sequence that is invalid
> in the encoding specified for this corpus
> JOS100K-EN>
> 
> which is strange, and the query uses only ASCII. Any hints?

This is quite mysterious to me, too.  Did you check that the declared encoding in the registry file is "utf8"?

The only possible explanation I can imagine is that CQP thinks it's an 8-bit encoding somehow ASCII control characters end up in the string (perhaps some problem with your terminal?) -- at least the newest versions of the CWB check aggressively for invalid control chars.

If you're really working with UTF-8, then GLib thinks your string isn't valid -- which GLib version do you use and how did you install it?

Of course, Andrew probably has a much better idea about what the cause of the problem could be ...

> Also, I noticed that if no corpus is selected, cqp dies:
> 
> [tomaz at mantra ~]$ cqp
> [no corpus]> "kdo";
> CQP Error:
>        No corpus activated
> Segmentation fault
> [tomaz at mantra ~]$

I can reproduce this bug on Mac OS X, fixed now in the SVN.

@Andrew: it was a simple mistake in prepare_Query(), where these lines

>   /* validate character encoding according to that corpus, now we know it's loaded */
>   if (!cl_string_validate_encoding(QueryBuffer, current_corpus->corpus->charset, 0)) {
>     cqpmessage(Error, "Query includes a character or character sequence that is invalid\n"
>         "in the encoding specified for this corpus");
>     generate_code = 0;
>   }
> 

should go into the "if (generate_code)" block, otherwise they'll try to obtain the charset current_corpus->corpus->charset from a NULL pointer if no corpus has been activated yet.

(Yes, I admit that the whole structure of that function, with generate_code=0 etc., is totally weird. It wasn't my idea. :-)

Best wishes,
Stefan



More information about the CWB mailing list