<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=koi8-r">
<META content="MSHTML 6.00.6000.17095" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2>Does the same thing happen if you try</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: transparent">C:\Windows\system32>cqp</DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: transparent">[no
corpus]> set DataDirectory ".";</DIV>
<DIV>[no corpus]> ANGELINA;</DIV>
<DIV>ANGELINA> <SPAN class=593533101-10042011>Query</SPAN><SPAN
class=593533101-10042011> = </SPAN>"ÛÔÏ";</DIV>
<DIV>ANGELINA><SPAN class=593533101-10042011>cat Query >
"Query.txt";</SPAN></DIV></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2>?</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2>Andrew.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=593533101-10042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000080 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> George Goce Mitrevski
[mailto:podmocani@yahoo.com] <BR><B>Sent:</B> 10 April 2011
02:28<BR><B>To:</B> Hardie, Andrew; Open source development of the Corpus
WorkBench<BR><B>Subject:</B> Re: [CWB] Encoding error in
Windows<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV
style="FONT-SIZE: 12pt; COLOR: #000; FONT-FAMILY: times new roman, new york, times, serif; BACKGROUND-COLOR: #fff">
<DIV><SPAN>
<DIV>Andrew,</DIV>
<DIV>Changing the codepage to UTF-8 doesn't seem to work either, cqp simply
quits. T<SPAN class=Apple-style-span
style="FONT-SIZE: 16px; COLOR: rgb(0,0,102); FONT-FAMILY: monospace; WHITE-SPACE: pre">rying
to write an UTF-8 string terminates </SPAN><SPAN class=Apple-style-span
style="COLOR: rgb(0,0,102); FONT-FAMILY: monospace; WHITE-SPACE: pre">cqp.</SPAN></DIV>
<DIV><SPAN class=Apple-style-span
style="COLOR: rgb(0,0,102); FONT-FAMILY: monospace; WHITE-SPACE: pre"><BR></SPAN></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: transparent">C:\Windows\system32>chcp
65001</DIV>
<DIV>Active code page: 65001</DIV>
<DIV><BR></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: transparent">C:\Windows\system32>cqp</DIV>
<DIV>[no corpus]> ANGELINA;</DIV>
<DIV>ANGELINA> "ÛÔÏ";</DIV>
<DIV><BR></DIV>
<DIV>C:\Windows\system32></DIV></SPAN></DIV>
<DIV
style="FONT-SIZE: 13px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: monospace; BACKGROUND-COLOR: transparent"><BR></DIV>
<DIV
style="FONT-SIZE: 13px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: monospace; BACKGROUND-COLOR: transparent">I
did some internet search on chcp 65001, and it seems others have had
problems with the console in chcp 65001. Here is what one person had to
say: "<SPAN class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif">CHCP
65001 indeed works quite well, but whenever I tried to run a batch file (pure
ASCII!) from such a console, it never worked. There no output, no error error
message, and the commands are not executed." Do we know if the Windows console
in fact supports 65001?</SPAN></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; BACKGROUND-COLOR: transparent"><SPAN
class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif"><BR></SPAN></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; BACKGROUND-COLOR: transparent"><SPAN
class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif">My
original corpus was in CP1251 and others are convincing me to re-encode it in
utf8, but now I'm wondering whether it's worth the effort if I can't work with
it in Windows. When I change the Windows console codepage to 1251 (chcp 1251)
I have no problems searching the corpus.</SPAN></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; BACKGROUND-COLOR: transparent"><SPAN
class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif"><BR></SPAN></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,102); FONT-STYLE: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; BACKGROUND-COLOR: transparent"><SPAN
class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif">Regards,</SPAN></DIV>
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif"><SPAN
class=Apple-style-span
style="FONT-SIZE: 16px; LINE-HEIGHT: 18px; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif">George.</SPAN></DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new
york', times, serif">
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif"><FONT
face=Arial size=2>
<HR SIZE=1>
<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> "Hardie, Andrew"
<a.hardie@lancaster.ac.uk><BR><B><SPAN
style="FONT-WEIGHT: bold">To:</SPAN></B> George Goce Mitrevski
<podmocani@yahoo.com>; Open source development of the Corpus WorkBench
<cwb@sslmit.unibo.it><BR><B><SPAN
style="FONT-WEIGHT: bold">Sent:</SPAN></B> Saturday, April 9, 2011 6:10
PM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [CWB]
Encoding error in Windows<BR></FONT><BR>
<META http-equiv=x-dns-prefetch-control content=off>
<DIV id=yiv1814182613>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2>As I noted in a previous mail, the ????
message indicates that the console is not passing well-formed UTF-8 characters
to CQP. Changing the cmd.exe code page to UTF-8 before running CQP may help
(chcp 65001). </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2>The -c option for cwb-encode is documented
in cwb-encode -h, but not yet in man cwb-encode. The corpus encoding tutorial
document is still targeted at v3.0 which does not have Unicode (or Windows)
support. </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN><SPAN
class=yiv1814182613328345622-09042011><FONT face=Verdana color=#000080
size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2>best</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2>Andrew.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left>
<HR tabIndex=-1>
</DIV>
<DIV dir=ltr align=left><FONT face=Tahoma size=2><B>From:</B> George Goce
Mitrevski [mailto:podmocani@yahoo.com] <BR><B>Sent:</B> 09 April 2011
16:39<BR><B>To:</B> Hardie, Andrew; Open source development of the Corpus
WorkBench<BR><B>Subject:</B> Re: [CWB] Encoding error in
Windows<BR></FONT><BR></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000080 2px solid; MARGIN-RIGHT: 0px">
<DIV
style="FONT-SIZE: 12pt; COLOR: rgb(0,0,0); FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: rgb(255,255,255)">
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif">Andrew,</DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif">Thanks for the
suggestion. It may be a good idea to include this info in the instruction
page for cwb-encode. The corpus was encoded just fine. However, I'm still
having hell of a problem getting cqp to accept Cyrillic character encoding
even in utf8. Has anyone been successful in encoding and searching a
cyrillic corpus in Windows? I didn't encounter any such problems on Unix.
Below is my encoding script and the search error:</DIV>
<DIV>
<DIV>cwb-encode -d "C:\CWB\ANGELINA\data" -f "C:\CWB\ANGELINA\angelina.txt"
-c utf8 -R "C:\CWB\registry\angelina" -xsB -S s:0 -S
text:0+id+title+author+genre -S subject:0 -S publisher:0 -S dateOrigonal:0
-S dateDigital:0 -S identifier:0 -S citation:0 -S source:0 -S relation:0 -S
hasPart:0 -S isPartOf:0</DIV>
<DIV><BR></DIV>
<DIV>C:\Windows\system32>cqp</DIV>
<DIV>[no corpus]> ANGELINA;</DIV>
<DIV>ANGELINA> "ÛÔÏ";</DIV>
<DIV>CL: Regex Compile Error: unrecognized character after (? or (?-</DIV>
<DIV>CQP Error:</DIV>
<DIV> Illegal regular expression: ???</DIV>
<DIV><BR></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: times, serif; BACKGROUND-COLOR: transparent">Regards,</DIV>
<DIV>George.</DIV></DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif"><BR></DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif">
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif"><FONT face=Arial
size=2>
<HR SIZE=1>
<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> "Hardie, Andrew"
<a.hardie@lancaster.ac.uk><BR><B><SPAN
style="FONT-WEIGHT: bold">To:</SPAN></B> Open source development of the
Corpus WorkBench <cwb@sslmit.unibo.it><BR><B><SPAN
style="FONT-WEIGHT: bold">Cc:</SPAN></B> George Goce Mitrevski
<podmocani@yahoo.com><BR><B><SPAN
style="FONT-WEIGHT: bold">Sent:</SPAN></B> Friday, April 8, 2011 4:15
PM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [CWB]
Encoding error in Windows<BR></FONT><BR>
<DIV id=yiv1814182613>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2>It means the encoding hasn't been set to
utf8. This is possibly because you haven't specified the encoding using
<B>-c utf8 </B>(cwb-encode defaults to Latin-1 if not told specifically what
encoding to use) </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011></SPAN><SPAN
class=yiv1814182613-08042011><FONT face=Verdana color=#000080
size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2>On the other hand, if you
<B><I>have</I></B> specified that it is utf-8, then it may be a bug. If
this is the case, could you specify precisely what command line you've been
using? Thanks.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2>best</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv1814182613-08042011><FONT
face=Verdana color=#000080 size=2>Andrew.</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000080 2px solid; MARGIN-RIGHT: 0px">
<DIV class=yiv1814182613OutlookMessageHeader lang=en-us dir=ltr
align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> cwb-bounces@sslmit.unibo.it
[mailto:cwb-bounces@sslmit.unibo.it] <B>On Behalf Of </B>George Goce
Mitrevski<BR><B>Sent:</B> 08 April 2011 22:09<BR><B>To:</B> Open source
development of the Corpus WorkBench<BR><B>Subject:</B> [CWB] Encoding
error in Windows<BR></FONT><BR></DIV>
<DIV
style="FONT-SIZE: 12pt; COLOR: rgb(0,0,0); FONT-FAMILY: times, serif; BACKGROUND-COLOR: rgb(255,255,255)">
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif">Can someone please
explain what's causing this encoding error when I try to encode corpus in
Window in utf8?</DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif"><BR></DIV>
<DIV style="FONT-FAMILY: times, serif">
<DIV style="FONT-FAMILY: times, serif">
<DIV id=yiv1814182613>
<DIV class=yiv1814182613Section1 dir=rtl>
<DIV class=yiv1814182613MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv1814182613Apple-style-span face=Arial><FONT
class=yiv1814182613Apple-style-span size=2>"Encoding error: an invalid
byte or byte sequence for charset "latin1" was
encountered."</FONT><BR></FONT></DIV>
<DIV class=yiv1814182613MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv1814182613Apple-style-span face=Arial><FONT
class=yiv1814182613Apple-style-span size=2><BR></FONT></FONT></DIV>
<DIV class=yiv1814182613MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv1814182613Apple-style-span face=Arial><FONT
class=yiv1814182613Apple-style-span size=2>Thanks
much.</FONT></FONT></DIV></DIV></DIV></DIV></DIV></DIV></BLOCKQUOTE></DIV><BR><BR></DIV></DIV></DIV></BLOCKQUOTE></DIV>
<META http-equiv=x-dns-prefetch-control
content=on><BR><BR></DIV></DIV></DIV></BLOCKQUOTE></BODY></HTML>