[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Sun May 27 16:58:40 CEST 2012

>>> frequency list reads like gibberish to me

That’s not gibberish, it’s UTF-8 being treated as if it was Latin-1. For instance, “æƒ¯” is “惯”. I  think this problem is very likely at the browser end. Check this by looking at how your browser is treating the pages. My guess is that it is set to “Western (ISO 8859-1)”. If you change the encoding to “UTF-8”, you should see the Chinese characters.

CQPweb does issue an HTTP header declaring the encoding of each page as UTF-8. However, I don’t know the details of how different browsers respond to that header; it’s possible your browser is set up to enforce some other encoding.

>>> I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet, is there an interface (even just envisaged)to do so?

The sort order used is the MySQL utf8_general_ci collation – which is far from satisfactory, but which is generally the best of a bad bunch for most purposes. I have plans for a replacement, but they are too big for this margin. I don’t know how utf8_general_ci works for Chinese I’m afraid, and a google does not turn up anything. I suspect it might be binary ordering.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 27 May 2012 15:09
To: Open source development of the Corpus WorkBench
Subject: Re: RE: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Hi Andrew,
Thanks for the new commit. I recompiled v 3.4.4 and cwb-scan-corpus complains no more.
The result is however mixed, using my corpus posted earlier. Good news first.

Success 1:Standard Query->Start Query (OK)
Success 2: Restricted query (CQP syntax) (OK)

The bad news is that the frequency list reads like gibberish to me (definitely not Chinese).
Issue 1:Standard query->Collocation->Create collocation database->Collocation controls.

NO.    word
1    ã€‚
2    ä¹ æƒ¯
...
Issue 2: Frequency lists->Show frequency list
No.    Word    Frequency
1    çš„    3
2    äº†    2
...

Also , no word in the Frequency list page can be linked back to its concordance view.

After checking the freq_corpus_test_word  table, I can see the item column contains  just gibberish there. That might be able to explain something.

Meanwhile, I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet, is there an interface (even just envisaged)to do so?

BTW: I will check your update for CQPweb a moment later and will post my findings in that thread. Thanks.

Best,
Ray

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120527/2e0359de/attachment.htm