[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 28 13:57:12 CEST 2012


>>> What's puzzling me is that if the culprit is the browser, why the stand query/restricted
query pages yield good results (the brower's character set on the corresponding
pages is also UTF-8)?  To my knowledge,a same browser is unlikey to treat pages discriminanlty if
their original encodings are enforced to be indentical (UTF-8 in this case).

This shows the problem is in your database (textual data in query results is from CQP, textual data in frequency lists or in collocation lists is from MySQL). The fact that it shows up on the command line correctly may not mean anything. You could try adjusting the $utf8_set_required variable in your config file, and see if that helps.

>>> SELECT * FROM table ORDER BY CONVERT( chinese_field USING gbk )

Yeah, this feature will not be appearing in CQPweb :)

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 27 May 2012 18:42
To: Open source development of the Corpus WorkBench
Subject: Re: RE: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Hi Andrew,

>>>  That’s not gibberish, it’s UTF-8 being treated as if it was Latin-1. For
instance, “惯” is “惯”. I  think this problem is very likely at the browser end.
Check this by looking at how your browser is treating the pages. My guess is that it
is set to “Western (ISO 8859-1)”. If you change the encoding to “UTF-8”, you
should see the Chinese characters. CQPweb does issue an HTTP header declaring
the encoding of each page as UTF-8. However, I don’t know the details of how
different browsers respond to that header; it’s possible your browser is set up to
enforce some other encoding.

I double checked those pages and find my browser (firefox 10.0.2) sets them
exactly to UTF-8. But the problem persists.

What's puzzling me is that if the culprit is the browser, why the stand query/restricted
query pages yield good results (the brower's character set on the corresponding
pages is also UTF-8)?  To my knowledge,a same browser is unlikey to treat pages discriminanlty if
their original encodings are enforced to be indentical (UTF-8 in this case).

What's more puzzling is from the MySQL command line, which says the Chinese
characters are stored there in good shape:
mysql> select * from freq_corpus_test_word;
+------+-----------+
| freq | item      |
+------+-----------+
...
|    2 | 。       |
|    3 | 的       |
|    1 | 网友    |
|    1 | 爱好者 |
|    1 | 表示    |
...
25 rows in set (0.00 sec)

So what's the real story?

 >>>The sort order used is the MySQL utf8_general_ci collation – which is far from
satisfactory, but which is generally the best of a bad bunch for most purposes. I have
plans for a replacement, but they are too big for this margin. I don’t know how
utf8_general_ci works for Chinese I’m afraid, and a google does not turn up
anything. I suspect it might be binary ordering.

I googled some Chinese pages regarding MySQL's sorting mechanism and find
some info, which might be helpful in our situation (although I haven't tried them myself).

Page ranked 1st, 3rd. change the columns storing Chinese into gbk (compiling mysql
with the directive --with--charset=gbk  or --with--charset=gb2312) to make it PINYIN aware.

SELECT * FROM table ORDER BY CONVERT( chinese_field USING gbk )

http://www.chinaunix.net/jh/17/15706.html
http://topic.csdn.net/u/20080730/11/32a3a5a3-40a9-4240-b2f6-64c6d230d302.html

While a page ranked 2nd refers to another page at
http://blog.chinaunix.net/space.php?uid=259788&do=blog&id=2139261  (a page encoded in gbk)

Basically, it recommends to sets up another PINYIN column in MySQL by
extracting the PINYIN of a character automatically, using a function as illustrated on
that page.

Best,
Ray


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120528/3e277f2a/attachment.htm


More information about the CWB mailing list