[CWB] Encoding problem in CWB version 3.2.4
Peter Ljunglöf
peter.ljunglof at gu.se
Tue Sep 28 10:27:05 CEST 2010
Hi,
I had the same problem, but I can't reproduce it now. I got the "invalid character" error when I called CQP through a web service (with a very tiny wrapper in Python). But not when I called CQP from within Python. And it was only on Latin-1 encoded corpora. Some kind of terminal encoding problem I guess.
But now I can't reproduce it since I can't create latin-1 corpora anymore... I'll write another mail about that.
/Peter
28 sep 2010 kl. 10.06 skrev Stefan Evert:
> Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)
>
>> this is probably some silly error, but I can't figure out where I made it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in utf-8 and I used cwb-encode -c utf8
>> But when I try using it:
>>
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> JOS100K-EN;
>> JOS100K-EN> "kaj";
>> CQP Error:
>> Query includes a character or character sequence that is invalid
>> in the encoding specified for this corpus
>> JOS100K-EN>
>>
>> which is strange, and the query uses only ASCII. Any hints?
>
> This is quite mysterious to me, too. Did you check that the declared encoding in the registry file is "utf8"?
>
> The only possible explanation I can imagine is that CQP thinks it's an 8-bit encoding somehow ASCII control characters end up in the string (perhaps some problem with your terminal?) -- at least the newest versions of the CWB check aggressively for invalid control chars.
>
> If you're really working with UTF-8, then GLib thinks your string isn't valid -- which GLib version do you use and how did you install it?
>
> Of course, Andrew probably has a much better idea about what the cause of the problem could be ...
>
>> Also, I noticed that if no corpus is selected, cqp dies:
>>
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> "kdo";
>> CQP Error:
>> No corpus activated
>> Segmentation fault
>> [tomaz at mantra ~]$
>
> I can reproduce this bug on Mac OS X, fixed now in the SVN.
Great!
________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet
More information about the CWB
mailing list