[CWB] Encoding problem in CWB version 3.2.4

Peter Ljunglöf peter.ljunglof at gu.se
Tue Sep 28 10:27:05 CEST 2010


Hi,

I had the same problem, but I can't reproduce it now. I got the "invalid character" error when I called CQP through a web service (with a very tiny wrapper in Python). But not when I called CQP from within Python. And it was only on Latin-1 encoded corpora. Some kind of terminal encoding problem I guess. 

But now I can't reproduce it since I can't create latin-1 corpora anymore... I'll write another mail about that.

/Peter

28 sep 2010 kl. 10.06 skrev Stefan Evert:

> Hi Tomaz! (<-- ASCII, to avoid validation problems ;-)
> 
>> this is probably some silly error, but I can't figure out where I made it. I tried using cwb-3.2.4 and made the corpus ok. The corpus is in utf-8 and I used cwb-encode -c utf8
>> But when I try using it:
>> 
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> JOS100K-EN;
>> JOS100K-EN> "kaj";
>> CQP Error:
>>       Query includes a character or character sequence that is invalid
>> in the encoding specified for this corpus
>> JOS100K-EN>
>> 
>> which is strange, and the query uses only ASCII. Any hints?
> 
> This is quite mysterious to me, too.  Did you check that the declared encoding in the registry file is "utf8"?
> 
> The only possible explanation I can imagine is that CQP thinks it's an 8-bit encoding somehow ASCII control characters end up in the string (perhaps some problem with your terminal?) -- at least the newest versions of the CWB check aggressively for invalid control chars.
> 
> If you're really working with UTF-8, then GLib thinks your string isn't valid -- which GLib version do you use and how did you install it?
> 
> Of course, Andrew probably has a much better idea about what the cause of the problem could be ...
> 
>> Also, I noticed that if no corpus is selected, cqp dies:
>> 
>> [tomaz at mantra ~]$ cqp
>> [no corpus]> "kdo";
>> CQP Error:
>>       No corpus activated
>> Segmentation fault
>> [tomaz at mantra ~]$
> 
> I can reproduce this bug on Mac OS X, fixed now in the SVN.

Great!


________________________________________________________________________________
peter ljunglöf, språkbanken, göteborgs universitet




More information about the CWB mailing list