[CWB] Encoding error in Windows

Sun Apr 10 03:27:35 CEST 2011

Andrew,
Changing the codepage to UTF-8 doesn't seem to work either, cqp simply quits. Trying to write an UTF-8 string terminates cqp.

C:\Windows\system32>chcp 65001
Active code page: 65001

C:\Windows\system32>cqp
[no corpus]> ANGELINA;
ANGELINA> "што";

C:\Windows\system32>

I did some internet search on chcp 65001, and it seems others have had problems with the console in chcp 65001. Here is what one person had to say: "CHCP 65001 indeed works quite well, but whenever I tried to run a batch file (pure ASCII!) from such a console, it never worked. There no output, no error error message, and the commands are not executed." Do we know if the Windows console in fact supports 65001?

My original corpus was in CP1251 and others are convincing me to re-encode it in utf8, but now I'm wondering whether it's worth the effort if I can't work with it in Windows. When I change the Windows console codepage to 1251 (chcp 1251) I have no problems searching the corpus.

Regards,
George.

________________________________
From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
To: George Goce Mitrevski <podmocani at yahoo.com>; Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Sent: Saturday, April 9, 2011 6:10 PM
Subject: Re: [CWB] Encoding error in Windows

As I noted in a previous mail, the ???? message indicates 
that the console is not passing well-formed UTF-8 characters to CQP. Changing 
the cmd.exe code page to UTF-8 before running CQP may help (chcp 
65001). 

The -c option for cwb-encode is documented in cwb-encode 
-h, but not yet in man cwb-encode. The corpus encoding tutorial document is 
still targeted at v3.0 which does not have Unicode (or Windows) support. 

best

Andrew.

________________________________

From: George Goce 
Mitrevski [mailto:podmocani at yahoo.com] 
Sent: 09 April 2011 
16:39
To: Hardie, Andrew; Open source development of the Corpus 
WorkBench
Subject: Re: [CWB] Encoding error in 
Windows

Andrew,
>Thanks  for the suggestion. It may be a good idea to include this info in the  instruction page for cwb-encode. The corpus was encoded just fine. However,  I'm still having hell of a problem getting cqp to accept Cyrillic character  encoding even in utf8. Has anyone been successful in encoding and searching a  cyrillic corpus in Windows? I didn't encounter any such problems on Unix.  Below is my encoding script and the search error:
>cwb-encode -d "C:\CWB\ANGELINA\data" -f "C:\CWB\ANGELINA\angelina.txt" -c  utf8 -R "C:\CWB\registry\angelina" -xsB -S s:0 -S text:0+id+title+author+genre  -S subject:0 -S publisher:0 -S dateOrigonal:0 -S dateDigital:0 -S identifier:0  -S citation:0 -S source:0 -S relation:0 -S hasPart:0 -S isPartOf:0
>
>
>C:\Windows\system32>cqp
>[no corpus]> ANGELINA;
>ANGELINA> "што";
>CL: Regex Compile Error: unrecognized character after (? or (?-
>CQP Error:
>        Illegal regular expression: ???
>
>
>Regards,
>George.
>
>
>
>________________________________
> From: "Hardie, Andrew"  <a.hardie at lancaster.ac.uk>
>To: Open source development of the Corpus  WorkBench <cwb at sslmit.unibo.it>
>Cc: George Goce Mitrevski  <podmocani at yahoo.com>
>Sent: Friday, April 8, 2011 4:15  PM
>Subject: Re: [CWB]  Encoding error in Windows
>
>
>It means the encoding hasn't been set to  utf8. This is possibly because you haven't specified the encoding using -c  utf8 (cwb-encode defaults to Latin-1 if not told specifically what  encoding to use) 
> 
>On the other hand, if you have specified that it is utf-8, then it may be a bug. If  this is the case, could you specify precisely what command line you've been  using? Thanks.
> 
>best
> 
>Andrew.
>
>
>>________________________________
>> From: cwb-bounces at sslmit.unibo.it  [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of George Goce  Mitrevski
>>Sent: 08 April 2011 22:09
>>To: Open source  development of the Corpus WorkBench
>>Subject: [CWB] Encoding error  in Windows
>>
>>
>>Can someone please  explain what's causing this encoding error when I try to encode corpus in  Window in utf8?
>>
>>
>>"Encoding error: an invalid byte  or byte sequence for charset "latin1" was  encountered."
>>
>>
>>
>>Thanks  much.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20110409/c2faf993/attachment-0001.htm