[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Ray Wu liangpingwu at 126.com
Sun May 27 19:42:17 CEST 2012


Hi Andrew,

>>>  That’s not gibberish, it’s UTF-8 being treated as if it was Latin-1. For
instance, “惯” is “惯”. I  think this problem is very likely at the browser end.
Check this by looking at how your browser is treating the pages. My guess is that it
is set to “Western (ISO 8859-1)”. If you change the encoding to “UTF-8”, you
should see the Chinese characters. CQPweb does issue an HTTP header declaring
the encoding of each page as UTF-8. However, I don’t know the details of how
different browsers respond to that header; it’s possible your browser is set up to
enforce some other encoding.

I double checked those pages and find my browser (firefox 10.0.2) sets them
exactly to UTF-8. But the problem persists.

What's puzzling me is that if the culprit is the browser, why the stand query/restricted
query pages yield good results (the brower's character set on the corresponding
pages is also UTF-8)?  To my knowledge,a same browser is unlikey to treat pages discriminanlty if
their original encodings are enforced to be indentical (UTF-8 in this case).

What's more puzzling is from the MySQL command line, which says the Chinese
characters are stored there in good shape:
mysql> select * from freq_corpus_test_word;
+------+-----------+
| freq | item      |
+------+-----------+
...
|    2 | 。       |
|    3 | 的       |
|    1 | 网友    |
|    1 | 爱好者 |
|    1 | 表示    |
...
25 rows in set (0.00 sec)

So what's the real story?

 >>>The sort order used is the MySQL utf8_general_ci collation – which is far from
satisfactory, but which is generally the best of a bad bunch for most purposes. I have
plans for a replacement, but they are too big for this margin. I don’t know how
utf8_general_ci works for Chinese I’m afraid, and a google does not turn up
anything. I suspect it might be binary ordering.

I googled some Chinese pages regarding MySQL's sorting mechanism and find
some info, which might be helpful in our situation (although I haven't tried them myself).

Page ranked 1st, 3rd. change the columns storing Chinese into gbk (compiling mysql
with the directive --with--charset=gbk  or --with--charset=gb2312) to make it PINYIN aware.

SELECT * FROM table ORDER BY CONVERT( chinese_field USING gbk )

http://www.chinaunix.net/jh/17/15706.html
http://topic.csdn.net/u/20080730/11/32a3a5a3-40a9-4240-b2f6-64c6d230d302.html

While a page ranked 2nd refers to another page at
http://blog.chinaunix.net/space.php?uid=259788&do=blog&id=2139261  (a page encoded in gbk)

Basically, it recommends to sets up another PINYIN column in MySQL by
extracting the PINYIN of a character automatically, using a function as illustrated on
that page.

Best,
Ray

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120528/3e2d011b/attachment-0001.htm


More information about the CWB mailing list