[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Sun May 27 13:13:56 CEST 2012

Hi Ray,

The segfault had nothing to do with your data, it was an internal structure used by cwb-scan-corpus for pattern matching. There is a fix in SVN now; recompile v 3.4.4

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 26 May 2012 13:49
To: Open source development of the Corpus WorkBench
Subject: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault

Hi all,
I want to make CQPweb to process Chinese (my native tongue) on my Ubuntu 8.04, so I updated to CWB 3.4.3. The compiling process was successful and I could query a tiny Chinese text via cqp from the terminal.

I could also load the Chinese text into CQPweb  and finished part of the metadata page. But when I wanted to "Manage metadata->Create frequency tables",  CQPweb complained and says it encountered an error and could not continue. Here is the error message:
cwb-scan-corpus error! Segmentation fault
... in file /usr/local/apache2/htdocs/cqp/lib/freqtable.inc.php line 100.

This sounds strange to me as I have browsed the entire archived mailing list and get to know that error message is mostly likely to happen when a token is too long. But my toy corpus is just a few lines long. I tried it on an small English text and the same situation occurs.

To make the picture clearer, I will try to illustrate my experiment by listing what I have done.

My compiling context for CWB 3.4.3: CWB from svn: 3.4.3; PCRE: 7.4; glib-2.0; gcc: Ubuntu 4.2.4-1ubuntu3.

The compiling process seemed normal and I could build a tiny Chinese corpus using the following text (See the end of the post). Hopefully it can make it through the wild net to your computer remaining intelligible):

I then ran the following to index it:
ray at ray-laptop:~$ cwb-encode -c utf8 -d /home/ray/cqputf8 -f cqpweb_chinese_test_utf8.txt -R /usr/local/share/cwb/registry/test -P pos -S text -S s -S text_id
Annotations of s-attribute <text> not stored (file cqpweb_chinese_test_utf8.txt, line #1, warning issued only once).
ray at ray-laptop:~$ cwb-makeall -V TEST (everyting says OK)
ray at ray-laptop:~$ cwb-huffcode -A TEST (fine, nothing wrong)
ray at ray-laptop:~$ cwb-compress-rdx  -A TEST (fine again)
I queried the new corpus and nothing broken:
ray at ray-laptop:~$ cqp -eC
[no corpus]> TEST
TEST> "了";
        7: 们 的 行为 也 引来 <了> 不少 公园 游客 的
       29:  ， 他们 早已 习惯 <了> 。
TEST> <s> []* "了" []* </s>;  (query is OK)

Finally, I resorted to run  cwb-scan-corpus manually and did find something usual:
ray at ray-laptop:~$ cwb-scan-corpus -C TEST pos (fully OK)
ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 (segmentation fault)
ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 pos+2
Scanning corpus TEST for 3-tuples ...
Scan complete.
Printing frequency table on stdout ...
...
段错误 ("segmentation fault" in English)
I have very little knowledge in C so I cannot go further to investigate more.
Does anyone know where the problem is? Thanks for any input.
Best,
Ray
Hunan University of Commerce, China
PS: My computer parameters:
System: Ubuntu 8.04
Apache: 2.0.63
MySQL: 5.0.88
PHP: 5.2.12 (lower than expected 5.3.0)
Perl: 5.8.8
CWB: 3.4.3 (compiled from svn source)
Linux utilites: awk, tar, gzip, iconv
LANG=zh_CN.UTF-8
GDM_LANG=zh_CN.UTF-8
Inside cqpweb_chinese_test_utf8.txt:
<text id="test">
<s>
这些    r
网友    n
们    k
的    u
行为    n
也    d
引来    v
了    u
不少    m
公园    n
游客    n
的    u
围观    v
。    w
</s>
<s>
而    c
对于    p
人们    n
的    u
议论    v
，    w
这些    r
汉    t
服    v
爱好者    n
表示    v
，    w
他们    r
早已    d
习惯    v
了    y
。    w
</s>
</text>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120527/05fdfb11/attachment-0001.htm