<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:MingLiU;
        panose-1:2 2 5 9 0 0 0 0 0 0;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:"\@MingLiU";
        panose-1:2 2 5 9 0 0 0 0 0 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
        {mso-style-priority:99;
        mso-style-link:"Balloon Text Char";
        margin:0cm;
        margin-bottom:.0001pt;
        font-size:8.0pt;
        font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
        {mso-style-name:"Balloon Text Char";
        mso-style-priority:99;
        mso-style-link:"Balloon Text";
        font-family:"Tahoma","sans-serif";}
span.EmailStyle19
        {mso-style-type:personal;
        font-family:"Verdana","sans-serif";
        color:#1F497D;}
span.EmailStyle20
        {mso-style-type:personal-reply;
        font-family:"Verdana","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">&gt;&gt;&gt;</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:black;background:white"> frequency list reads like gibberish to me</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">That’s not gibberish, it’s UTF-8 being treated as if it was Latin-1. For instance, “</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:black;background:white">惯</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">”
 is “</span><span style="font-size:10.0pt;font-family:MingLiU;color:#1F497D">惯</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">”. I&nbsp; think this problem is very likely at the browser end. Check this by looking at how your
 browser is treating the pages. My guess is that it is set to “Western (ISO 8859-1)”. If you change the encoding to “UTF-8”, you should see the Chinese characters.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">CQPweb does issue an HTTP header declaring the encoding of each page as UTF-8. However, I don’t know the details of how different browsers respond to that header;
 it’s possible your browser is set up to enforce some other encoding.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">&gt;&gt;&gt;
</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:black;background:white">I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character
 strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet, is there an interface (even just envisaged)to do so?</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">The sort order used is the MySQL utf8_general_ci collation – which is far from satisfactory, but which is generally the best of a bad bunch for most purposes.
 I have plans for a replacement, but they are too big for this margin. I don’t know how utf8_general_ci works for Chinese I’m afraid, and a google does not turn up anything. I suspect it might be binary ordering.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">best<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Andrew.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">
<a href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a> [<a href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>]
<b>On Behalf Of </b>Ray Wu<br>
<b>Sent:</b> 27 May 2012 15:09<br>
<b>To:</b> Open source development of the Corpus WorkBench<br>
<b>Subject:</b> Re: RE: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:10.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:black">Hi Andrew,<br>
<span style="background:white">Thanks for the new commit. I recompiled v 3.4.4 and
</span></span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:black;background:white">cwb-scan-corpus complains no more.
</span><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:black"><br>
<span style="background:white">The result is however mixed, using my corpus posted earlier. Good news first.<br>
<br>
Success 1:Standard Query-&gt;Start Query (OK)<br>
Success 2: Restricted query (CQP syntax) (OK)<br>
<br>
The bad news is that the frequency list reads like gibberish to me (definitely not Chinese).<br>
Issue 1:Standard query-&gt;Collocation-&gt;Create collocation database-&gt;Collocation controls.<br>
<br>
NO.&nbsp;&nbsp;&nbsp; word<br>
1&nbsp;&nbsp;&nbsp; ã€‚<br>
2&nbsp;&nbsp;&nbsp; ä¹ æƒ¯<br>
...<br>
Issue 2: Frequency lists-&gt;Show frequency list<br>
No.&nbsp;&nbsp;&nbsp; Word&nbsp;&nbsp;&nbsp; Frequency<br>
1&nbsp;&nbsp;&nbsp; çš„&nbsp;&nbsp;&nbsp; 3<br>
2&nbsp;&nbsp;&nbsp; äº†&nbsp;&nbsp;&nbsp; 2<br>
...<br>
<br>
Also , no word in the Frequency list page can be linked back to its concordance view.<br>
<br>
After checking the freq_corpus_test_word&nbsp; table, I can see the item column contains&nbsp; just gibberish there. That might be able to explain something.<br>
<br>
Meanwhile, I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet,
 is there an interface (even just envisaged)to do so?<br>
<br>
BTW: I will check your update for CQPweb a moment later and will post my findings in that thread. Thanks.<br>
<br>
Best,<br>
Ray<br>
<br>
</span></span><span style="color:black"><o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><o:p>&nbsp;</o:p></p>
</div>
</body>
</html>