<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
@font-face
        {font-family:MingLiU;
        panose-1:2 2 5 9 0 0 0 0 0 0;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:"\@MingLiU";
        panose-1:2 2 5 9 0 0 0 0 0 0;}
@font-face
        {font-family:"\@MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Verdana","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">>>></span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> What's puzzling me is that if the culprit is the browser, why the stand
query/restricted<br>
query pages yield good results (the brower's character set on the corresponding<br>
pages is also UTF-8)? To my knowledge,a same browser is unlikey to treat pages discriminanlty if<br>
their original encodings are enforced to be indentical (UTF-8 in this case).</span><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">This shows the problem is in your database (textual data in query results is from CQP, textual data in frequency lists or in collocation lists is from MySQL).
The fact that it shows up on the command line correctly may not mean anything. You could try adjusting the
</span><span style="font-family:"Courier New";color:black">$utf8_set_required </span>
<span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">variable in your config file, and see if that helps.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">>>></span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> SELECT * FROM table ORDER BY CONVERT( chinese_field USING gbk )</span><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Yeah, this feature will not be appearing in CQPweb :)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">best<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> cwb-bounces@sslmit.unibo.it [mailto:cwb-bounces@sslmit.unibo.it]
<b>On Behalf Of </b>Ray Wu<br>
<b>Sent:</b> 27 May 2012 18:42<br>
<b>To:</b> Open source development of the Corpus WorkBench<br>
<b>Subject:</b> Re: RE: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">Hi Andrew,<br>
<br>
>>> That’s not gibberish, it’s UTF-8 being treated as if it was Latin-1. For<br>
instance, “惯” is “</span><span style="font-size:10.5pt;font-family:MingLiU;color:black">惯</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black">”. I think this problem is very likely at the browser end.<br>
Check this by looking at how your browser is treating the pages. My guess is that it<br>
is set to “Western (ISO 8859-1)”. If you change the encoding to “UTF-8”, you<br>
should see the Chinese characters. CQPweb does issue an HTTP header declaring<br>
the encoding of each page as UTF-8. However, I don’t know the details of how<br>
different browsers respond to that header; it’s possible your browser is set up to<br>
enforce some other encoding.<br>
<br>
I double checked those pages and find my browser (firefox 10.0.2) sets them<br>
exactly to UTF-8. But the problem persists.<br>
<br>
What's puzzling me is that if the culprit is the browser, why the stand query/restricted<br>
query pages yield good results (the brower's character set on the corresponding<br>
pages is also UTF-8)? To my knowledge,a same browser is unlikey to treat pages discriminanlty if<br>
their original encodings are enforced to be indentical (UTF-8 in this case).<br>
<br>
What's more puzzling is from the MySQL command line, which says the Chinese<br>
characters are stored there in good shape:<br>
mysql> select * from freq_corpus_test_word;<br>
+------+-----------+<br>
| freq | item |<br>
+------+-----------+<br>
...<br>
| 2 | </span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">。</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> |<br>
| 3 | </span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">的</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> |<br>
| 1 | </span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">网友</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> |<br>
| 1 | </span><span style="font-size:10.5pt;font-family:MingLiU;color:black">爱好者</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> |<br>
| 1 | </span><span style="font-size:10.5pt;font-family:"MS Gothic";color:black">表示</span><span style="font-size:10.5pt;font-family:"Arial","sans-serif";color:black"> |<br>
...<br>
25 rows in set (0.00 sec)<br>
<br>
So what's the real story? <br>
<br>
>>>The sort order used is the MySQL utf8_general_ci collation – which is far from<br>
satisfactory, but which is generally the best of a bad bunch for most purposes. I have<br>
plans for a replacement, but they are too big for this margin. I don’t know how<br>
utf8_general_ci works for Chinese I’m afraid, and a google does not turn up<br>
anything. I suspect it might be binary ordering.<br>
<br>
I googled some Chinese pages regarding MySQL's sorting mechanism and find<br>
some info, which might be helpful in our situation (although I haven't tried them myself).<br>
<br>
Page ranked 1st, 3rd. change the columns storing Chinese into gbk (compiling mysql<br>
with the directive --with--charset=gbk or --with--charset=gb2312) to make it PINYIN aware.<br>
<br>
SELECT * FROM table ORDER BY CONVERT( chinese_field USING gbk )<br>
<br>
<a href="http://www.chinaunix.net/jh/17/15706.html">http://www.chinaunix.net/jh/17/15706.html</a><br>
<a href="http://topic.csdn.net/u/20080730/11/32a3a5a3-40a9-4240-b2f6-64c6d230d302.html">http://topic.csdn.net/u/20080730/11/32a3a5a3-40a9-4240-b2f6-64c6d230d302.html</a><br>
<br>
While a page ranked 2nd refers to another page at<br>
<a href="http://blog.chinaunix.net/space.php?uid=259788&do=blog&id=2139261">http://blog.chinaunix.net/space.php?uid=259788&do=blog&id=2139261</a> (a page encoded in gbk)<br>
<br>
Basically, it recommends to sets up another PINYIN column in MySQL by <br>
extracting the PINYIN of a character automatically, using a function as illustrated on<br>
that page.<br>
<br>
Best,<br>
Ray<br>
<br>
<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><o:p> </o:p></p>
</div>
</body>
</html>