[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Sun May 27 16:09:02 CEST 2012

Hi Andrew,
Thanks for the new commit. I recompiled v 3.4.4 and cwb-scan-corpus complains no more.
The result is however mixed, using my corpus posted earlier. Good news first.

Success 1:Standard Query->Start Query (OK)
Success 2: Restricted query (CQP syntax) (OK)

The bad news is that the frequency list reads like gibberish to me (definitely not Chinese).
Issue 1:Standard query->Collocation->Create collocation database->Collocation controls.

NO.    word
1    ã€‚
2    ä¹ æƒ¯
...
Issue 2: Frequency lists->Show frequency list
No.    Word    Frequency
1    çš„    3
2    äº†    2
...

Also , no word in the Frequency list page can be linked back to its concordance view.

After checking the freq_corpus_test_word  table, I can see the item column contains  just gibberish there. That might be able to explain something.

Meanwhile, I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet, is there an interface (even just envisaged)to do so?

BTW: I will check your update for CQPweb a moment later and will post my findings in that thread. Thanks.

Best,
Ray

At 2012-05-27 19:13:56,"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:

Hi Ray,

The segfault had nothing to do with your data, it was an internal structure used by cwb-scan-corpus for pattern matching. There is a fix in SVN now; recompile v 3.4.4

best

Andrew.

From:cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 26 May 2012 13:49
To: Open source development of the Corpus WorkBench
Subject: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Hi all,
I want to make CQPweb to process Chinese (my native tongue) on my Ubuntu 8.04, so I updated to CWB 3.4.3. The compiling process was successful and I could query a tiny Chinese text via cqp from the terminal.

I could also load the Chinese text into CQPweb  and finished part of the metadata page. But when I wanted to "Manage metadata->Create frequency tables",  CQPweb complained and says it encountered an error and could not continue. Here is the error message:
cwb-scan-corpus error! Segmentation fault
... in file /usr/local/apache2/htdocs/cqp/lib/freqtable.inc.php line 100.

This sounds strange to me as I have browsed the entire archived mailing list and get to know that error message is mostly likely to happen when a token is too long. But my toy corpus is just a few lines long. I tried it on an small English text and the same situation occurs.

To make the picture clearer, I will try to illustrate my experiment by listing what I have done.

My compiling context for CWB 3.4.3: CWB from svn: 3.4.3; PCRE: 7.4; glib-2.0; gcc: Ubuntu 4.2.4-1ubuntu3.

The compiling process seemed normal and I could build a tiny Chinese corpus using the following text (See the end of the post). Hopefully it can make it through the wild net to your computer remaining intelligible):

I then ran the following to index it:

ray at ray-laptop:~$ cwb-encode -c utf8 -d /home/ray/cqputf8 -f cqpweb_chinese_test_utf8.txt -R /usr/local/share/cwb/registry/test -P pos -S text -S s -S text_id

Annotations of s-attribute <text> not stored (file cqpweb_chinese_test_utf8.txt, line #1, warning issued only once).

ray at ray-laptop:~$ cwb-makeall -V TEST (everyting says OK)

ray at ray-laptop:~$ cwb-huffcode -A TEST (fine, nothing wrong)

ray at ray-laptop:~$ cwb-compress-rdx  -A TEST (fine again)

I queried the new corpus and nothing broken:

ray at ray-laptop:~$ cqp -eC

[no corpus]> TEST

TEST> "了";

        7: 们的行为也引来 <了> 不少公园游客的

       29:  ，他们早已习惯 <了> 。

TEST> <s> []* "了" []* </s>;  (query is OK)

Finally, I resorted to run  cwb-scan-corpus manually and did find something usual:

ray at ray-laptop:~$ cwb-scan-corpus -C TEST pos (fully OK)

ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 (segmentation fault)

ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 pos+2

Scanning corpus TEST for 3-tuples ...

Scan complete.                          

Printing frequency table on stdout ...

...

段错误 ("segmentation fault" in English)

I have very little knowledge in C so I cannot go further to investigate more.

Does anyone know where the problem is? Thanks for any input.

Best,
Ray

Hunan University of Commerce, China

PS: My computer parameters:

System: Ubuntu 8.04
Apache: 2.0.63
MySQL: 5.0.88
PHP: 5.2.12 (lower than expected 5.3.0)
Perl: 5.8.8
CWB: 3.4.3 (compiled from svn source)
Linux utilites: awk, tar, gzip, iconv

LANG=zh_CN.UTF-8

GDM_LANG=zh_CN.UTF-8

Inside cqpweb_chinese_test_utf8.txt:

<text id="test">
<s>
这些    r
网友    n
们    k
的    u
行为    n
也    d
引来    v
了    u
不少    m
公园    n
游客    n
的    u
围观    v
。    w
</s>
<s>
而    c
对于    p
人们    n
的    u
议论    v
，    w
这些    r
汉    t
服    v
爱好者    n
表示    v
，    w
他们    r
早已    d
习惯    v
了    y
。    w
</s>
</text>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120527/398d4762/attachment-0001.htm