<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=koi8-r">
<META content="MSHTML 6.00.6000.17095" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2>As I noted in a previous mail, the ???? message indicates
that the console is not passing well-formed UTF-8 characters to CQP. Changing
the cmd.exe code page to UTF-8 before running CQP may help (chcp
65001). </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2>The -c option for cwb-encode is documented in cwb-encode
-h, but not yet in man cwb-encode. The corpus encoding tutorial document is
still targeted at v3.0 which does not have Unicode (or Windows) support.
</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN><SPAN class=328345622-09042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2>best</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2>Andrew.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=328345622-09042011><FONT face=Verdana
color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left>
<HR tabIndex=-1>
</DIV>
<DIV dir=ltr align=left><FONT face=Tahoma size=2><B>From:</B> George Goce
Mitrevski [mailto:podmocani@yahoo.com] <BR><B>Sent:</B> 09 April 2011
16:39<BR><B>To:</B> Hardie, Andrew; Open source development of the Corpus
WorkBench<BR><B>Subject:</B> Re: [CWB] Encoding error in
Windows<BR></FONT><BR></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000080 2px solid; MARGIN-RIGHT: 0px">
<DIV></DIV>
<DIV
style="FONT-SIZE: 12pt; COLOR: #000; FONT-FAMILY: times new roman, new york, times, serif; BACKGROUND-COLOR: #fff">
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif">Andrew,</DIV>
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif">Thanks
for the suggestion. It may be a good idea to include this info in the
instruction page for cwb-encode. The corpus was encoded just fine. However,
I'm still having hell of a problem getting cqp to accept Cyrillic character
encoding even in utf8. Has anyone been successful in encoding and searching a
cyrillic corpus in Windows? I didn't encounter any such problems on Unix.
Below is my encoding script and the search error:</DIV>
<DIV>
<DIV>cwb-encode -d "C:\CWB\ANGELINA\data" -f "C:\CWB\ANGELINA\angelina.txt" -c
utf8 -R "C:\CWB\registry\angelina" -xsB -S s:0 -S text:0+id+title+author+genre
-S subject:0 -S publisher:0 -S dateOrigonal:0 -S dateDigital:0 -S identifier:0
-S citation:0 -S source:0 -S relation:0 -S hasPart:0 -S isPartOf:0</DIV>
<DIV><BR></DIV>
<DIV>C:\Windows\system32>cqp</DIV>
<DIV>[no corpus]> ANGELINA;</DIV>
<DIV>ANGELINA> "ÛÔÏ";</DIV>
<DIV>CL: Regex Compile Error: unrecognized character after (? or (?-</DIV>
<DIV>CQP Error:</DIV>
<DIV> Illegal regular expression: ???</DIV>
<DIV><BR></DIV>
<DIV
style="FONT-SIZE: 16px; COLOR: rgb(0,0,0); FONT-STYLE: normal; FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: transparent">Regards,</DIV>
<DIV>George.</DIV></DIV>
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif"><BR></DIV>
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif">
<DIV
style="FONT-SIZE: 12pt; FONT-FAMILY: 'times new roman', 'new york', times, serif"><FONT
face=Arial size=2>
<HR SIZE=1>
<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> "Hardie, Andrew"
<a.hardie@lancaster.ac.uk><BR><B><SPAN
style="FONT-WEIGHT: bold">To:</SPAN></B> Open source development of the Corpus
WorkBench <cwb@sslmit.unibo.it><BR><B><SPAN
style="FONT-WEIGHT: bold">Cc:</SPAN></B> George Goce Mitrevski
<podmocani@yahoo.com><BR><B><SPAN
style="FONT-WEIGHT: bold">Sent:</SPAN></B> Friday, April 8, 2011 4:15
PM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [CWB]
Encoding error in Windows<BR></FONT><BR>
<META http-equiv=x-dns-prefetch-control content=off>
<DIV id=yiv306971045>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2>It means the encoding hasn't been set to
utf8. This is possibly because you haven't specified the encoding using <B>-c
utf8 </B>(cwb-encode defaults to Latin-1 if not told specifically what
encoding to use) </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN
class=yiv306971045390151121-08042011></SPAN><SPAN
class=yiv306971045390151121-08042011><FONT face=Verdana color=#000080
size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2>On the other hand, if you
<B><EM>have</EM></B> specified that it is utf-8, then it may be a bug. If
this is the case, could you specify precisely what command line you've been
using? Thanks.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2>best</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=yiv306971045390151121-08042011><FONT
face=Verdana color=#000080 size=2>Andrew.</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000080 2px solid; MARGIN-RIGHT: 0px">
<DIV class=yiv306971045OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> cwb-bounces@sslmit.unibo.it
[mailto:cwb-bounces@sslmit.unibo.it] <B>On Behalf Of </B>George Goce
Mitrevski<BR><B>Sent:</B> 08 April 2011 22:09<BR><B>To:</B> Open source
development of the Corpus WorkBench<BR><B>Subject:</B> [CWB] Encoding error
in Windows<BR></FONT><BR></DIV>
<DIV
style="FONT-SIZE: 12pt; COLOR: rgb(0,0,0); FONT-FAMILY: 'times new roman', 'new york', times, serif; BACKGROUND-COLOR: rgb(255,255,255)">
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif">Can someone please
explain what's causing this encoding error when I try to encode corpus in
Window in utf8?</DIV>
<DIV style="FONT-SIZE: 12pt; FONT-FAMILY: times, serif"><BR></DIV>
<DIV style="FONT-FAMILY: times, serif">
<DIV style="FONT-FAMILY: times, serif">
<DIV id=yiv306971045>
<DIV class=yiv306971045Section1 dir=rtl>
<DIV class=yiv306971045MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv306971045Apple-style-span face=Arial><FONT
class=yiv306971045Apple-style-span size=2>"Encoding error: an invalid byte
or byte sequence for charset "latin1" was
encountered."</FONT><BR></FONT></DIV>
<DIV class=yiv306971045MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv306971045Apple-style-span face=Arial><FONT
class=yiv306971045Apple-style-span size=2><BR></FONT></FONT></DIV>
<DIV class=yiv306971045MsoNormal dir=ltr
style="DIRECTION: ltr; unicode-bidi: embed; TEXT-ALIGN: left"><FONT
class=yiv306971045Apple-style-span face=Arial><FONT
class=yiv306971045Apple-style-span size=2>Thanks
much.</FONT></FONT></DIV></DIV></DIV></DIV></DIV></DIV></BLOCKQUOTE></DIV>
<META http-equiv=x-dns-prefetch-control
content=on><BR><BR></DIV></DIV></DIV></BLOCKQUOTE></BODY></HTML>