<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:Wingdings;
        panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
        {font-family:"MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:"\@MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0cm;
        margin-right:0cm;
        margin-bottom:0cm;
        margin-left:36.0pt;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Verdana","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1890606752;
        mso-list-type:hybrid;
        mso-list-template-ids:-452702824 134807553 134807555 134807557 134807553 134807555 134807557 134807553 134807555 134807557;}
@list l0:level1
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Symbol;}
@list l0:level2
        {mso-level-number-format:bullet;
        mso-level-text:o;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:"Courier New";}
@list l0:level3
        {mso-level-number-format:bullet;
        mso-level-text:;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Hmmm. Ray’s initial problem was solved, but there does seem to be an underlying problem – arising from the English charset defaulting to Latin1 when unspecified,
and the aligned Chinese data thus being treated as Latin1 for output. (Though I don’t understand why the Chinese data, once output, wasn’t treated as UTF8 by the terminal...)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">There is some kind of issue here, however I’m not quite sure what the answer is.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:10.0pt;font-family:Symbol;color:#1F497D"><span style="mso-list:Ignore">·<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Clearly, it would be advantageous to allow alignment to be declared between two corpora that are in different charsets.<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:10.0pt;font-family:Symbol;color:#1F497D"><span style="mso-list:Ignore">·<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">However, that creates problems for display, since...<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:10.0pt;font-family:Symbol;color:#1F497D"><span style="mso-list:Ignore">·<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">...it’s equally clearly undesirable for CQP to be outputting two charsets in the same chunk of output.<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:10.0pt;font-family:Symbol;color:#1F497D"><span style="mso-list:Ignore">·<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Possible solutions:<o:p></o:p></span></p>
<p class="MsoListParagraph" style="margin-left:72.0pt;text-indent:-18.0pt;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-size:10.0pt;font-family:"Courier New";color:#1F497D"><span style="mso-list:Ignore">o<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">If the aligned chunk comes from a corpus with a different charset from the main corpus, recode it<o:p></o:p></span></p>
<p class="MsoListParagraph" style="margin-left:108.0pt;text-indent:-18.0pt;mso-list:l0 level3 lfo1">
<![if !supportLists]><span style="font-size:10.0pt;font-family:Wingdings;color:#1F497D"><span style="mso-list:Ignore">§<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">(using iconv or something)<o:p></o:p></span></p>
<p class="MsoListParagraph" style="margin-left:72.0pt;text-indent:-18.0pt;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-size:10.0pt;font-family:"Courier New";color:#1F497D"><span style="mso-list:Ignore">o<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">If the aligned chunk comes from a corpus with a different charset from the main corpus, print an error message<o:p></o:p></span></p>
<p class="MsoListParagraph" style="margin-left:72.0pt;text-indent:-18.0pt;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-size:10.0pt;font-family:"Courier New";color:#1F497D"><span style="mso-list:Ignore">o<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">If the aligned chunk comes from a corpus with a different charset from the main corpus, print its position in the corpus<o:p></o:p></span></p>
<p class="MsoListParagraph" style="margin-left:72.0pt;text-indent:-18.0pt;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-size:10.0pt;font-family:"Courier New";color:#1F497D"><span style="mso-list:Ignore">o<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Disallow alignment between corpora with different charsets.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Thoughts, everyone?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> cwb-bounces@sslmit.unibo.it [mailto:cwb-bounces@sslmit.unibo.it]
<b>On Behalf Of </b>Ray Wu<br>
<b>Sent:</b> 10 July 2012 03:54<br>
<b>To:</b> Open source development of the Corpus WorkBench<br>
<b>Subject:</b> Re: [CWB] Does cwb-align-encode support utf8?<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">Ah, I switched BOTH "en" and "cn" to utf8 and it works now. The moral seems to always use utf8 when dealing with CJK scripts.<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">--<br>
Best,<br>
Ray<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
At 2012-07-10 09:29:56,"Ray Wu" <liangpingwu@126.com> wrote:<br>
<br>
<o:p></o:p></span></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">Hi all,<br>
Does anyone know whether cwb-align-encode support utf8 now? I'm running a problem when trying to align English with Chinese (encoded in utf8). It seems that I can query Chinese-English pairs but not reversely. The English-Chinese pairs do not give proper Chinese.<br>
<br>
Ok, here is my scenario for your reference.<br>
<br>
I have two toy corpora named "cn" and "en" respectively.<br>
======================================<br>
cn:<br>
<a_cn_en id="cn_en_1"><br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">我</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">是</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">一个</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">兵</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">。</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
</a_cn_en><br>
------------------------------------------------<br>
en:<br>
<a_cn_en id="cn_en_1"><br>
I<br>
am<br>
a<br>
soldier<br>
.<br>
</a_cn_en><br>
<br>
<br>
registry files:<br>
-----------------------------------------------<br>
cn:<br>
NAME "cn"<br>
ID cn<br>
HOME /home/ray/bilingual/cn<br>
ATTRIBUTE word<br>
STRUCTURE a_cn_en<br>
----------------------------------------------<br>
en:<br>
NAME "en"<br>
ID en<br>
HOME /home/ray/bilingual/en<br>
ATTRIBUTE word<br>
STRUCTURE a_cn_en<br>
=====================================<br>
<br>
I run the following step by step:<br>
ray@ray-desktop:~$ export CORPUS_REGISTRY=/home/ray/bilingual/registry<br>
ray@ray-desktop:~$ cwb-encode -c utf8 -d /home/ray/bilingual/cn -f /home/ray/bilingual/data/cn -R /home/ray/bilingual/registry/cn -S a_cn_en<br>
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/cn, line #1, warning issued only once).<br>
<br>
ray@ray-desktop:~$ cwb-encode -d /home/ray/bilingual/en -f /home/ray/bilingual/data/en -R /home/ray/bilingual/registry/en -S a_cn_en<br>
Annotations of s-attribute <a_cn_en> not stored (file /home/ray/bilingual/data/en, line #1, warning issued only once).<br>
<br>
ray@ray-desktop:~$ cwb-make -V EN<br>
ray@ray-desktop:~$ cwb-make -V CN<br>
ray@ray-desktop:~$ cwb-align -v -o out.align CN EN a_cn_en <br>
ray@ray-desktop:~$ cwb-align-show out.align <br>
Displaying alignment for [CN, EN] from file out.align<br>
Enter 'h' for help.<br>
>> p<br>
1:1-alignment [0, 4] x [0, 4] (12)============================================<br>
<br>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">我</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">是</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">一个</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">兵</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">。</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> I am a soldier .
<br>
>> <br>
<br>
ray@ray-desktop:~$ cwb-align -v -o out2.align EN CN a_cn_en <br>
ray@ray-desktop:~$ cwb-align-show out2.align <br>
Displaying alignment for [EN, CN] from file out2.align<br>
Enter 'h' for help.<br>
>> <br>
1:1-alignment [0, 4] x [0, 4] (12)============================================<br>
<br>
I am a soldier . </span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">我</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">是</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">一个</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">兵</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">。</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
<br>
>> <br>
<br>
<br>
I added the following line in /home/ray/bilingual/registry/en:<br>
ALIGNED cn<br>
Similarly, in /home/ray/bilingual/registry/cn I added:<br>
ALIGNED en<br>
<br>
ray@ray-desktop:~$ cwb-align-encode -D out.align <br>
ray@ray-desktop:~$ cwb-align-encode -D out2.align<br>
<br>
ray@ray-desktop:~$ cqp<br>
[no corpus]> CN;<br>
CN> show +en;<br>
CN> "</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">我</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">";<br>
0: <</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">我</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">>
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">是</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">一个</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">兵</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
</span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">。</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
-->en: I am a soldier .<br>
CN> <br>
<br>
ray@ray-desktop:~$ cqp<br>
[no corpus]> EN;<br>
EN> show +cn;<br>
EN> "I";<br>
0: <I> am a soldier .<br>
-->cn: <88><91> <98></span><span style="font-size:10.5pt;font-family:"Tahoma","sans-serif";color:black">�</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> <80></span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">个</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">
<85></span><span style="font-size:10.5pt;font-family:"Tahoma","sans-serif";color:black">�</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> <80><82><br>
EN> <br>
<br>
As you can see, the English-Chinese alignment doesn't yield proper Chinese.<br>
<br>
My question is: is this a cwb-align-encode problem or a cqp problem? Thanks for any tips.<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><br>
Best,<br>
Ray<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"><o:p> </o:p></span></p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><o:p> </o:p></p>
</div>
</body>
</html>