[CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error!
Segmentation fault
Ray Wu
liangpingwu at 126.com
Sun May 27 16:09:02 CEST 2012
Hi Andrew,
Thanks for the new commit. I recompiled v 3.4.4 and cwb-scan-corpus complains no more.
The result is however mixed, using my corpus posted earlier. Good news first.
Success 1:Standard Query->Start Query (OK)
Success 2: Restricted query (CQP syntax) (OK)
The bad news is that the frequency list reads like gibberish to me (definitely not Chinese).
Issue 1:Standard query->Collocation->Create collocation database->Collocation controls.
NO. word
1 。
2 ä¹ æƒ¯
...
Issue 2: Frequency lists->Show frequency list
No. Word Frequency
1 çš„ 3
2 了 2
...
Also , no word in the Frequency list page can be linked back to its concordance view.
After checking the freq_corpus_test_word table, I can see the item column contains just gibberish there. That might be able to explain something.
Meanwhile, I also want to know how sorting is done for languages other than English. For Chinese, there are usually two types of sorting: PINYIN(bopomofa) and character strokes. Is it possible to do that kind of thing in CQPweb? If not present in CQPweb yet, is there an interface (even just envisaged)to do so?
BTW: I will check your update for CQPweb a moment later and will post my findings in that thread. Thanks.
Best,
Ray
At 2012-05-27 19:13:56,"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:
Hi Ray,
The segfault had nothing to do with your data, it was an internal structure used by cwb-scan-corpus for pattern matching. There is a fix in SVN now; recompile v 3.4.4
best
Andrew.
From:cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ray Wu
Sent: 26 May 2012 13:49
To: Open source development of the Corpus WorkBench
Subject: [CWB] CQPweb 3.0.7 on CWB 3.4.3 cwb-scan-corpus error! Segmentation fault
Hi all,
I want to make CQPweb to process Chinese (my native tongue) on my Ubuntu 8.04, so I updated to CWB 3.4.3. The compiling process was successful and I could query a tiny Chinese text via cqp from the terminal.
I could also load the Chinese text into CQPweb and finished part of the metadata page. But when I wanted to "Manage metadata->Create frequency tables", CQPweb complained and says it encountered an error and could not continue. Here is the error message:
cwb-scan-corpus error! Segmentation fault
... in file /usr/local/apache2/htdocs/cqp/lib/freqtable.inc.php line 100.
This sounds strange to me as I have browsed the entire archived mailing list and get to know that error message is mostly likely to happen when a token is too long. But my toy corpus is just a few lines long. I tried it on an small English text and the same situation occurs.
To make the picture clearer, I will try to illustrate my experiment by listing what I have done.
My compiling context for CWB 3.4.3: CWB from svn: 3.4.3; PCRE: 7.4; glib-2.0; gcc: Ubuntu 4.2.4-1ubuntu3.
The compiling process seemed normal and I could build a tiny Chinese corpus using the following text (See the end of the post). Hopefully it can make it through the wild net to your computer remaining intelligible):
I then ran the following to index it:
ray at ray-laptop:~$ cwb-encode -c utf8 -d /home/ray/cqputf8 -f cqpweb_chinese_test_utf8.txt -R /usr/local/share/cwb/registry/test -P pos -S text -S s -S text_id
Annotations of s-attribute <text> not stored (file cqpweb_chinese_test_utf8.txt, line #1, warning issued only once).
ray at ray-laptop:~$ cwb-makeall -V TEST (everyting says OK)
ray at ray-laptop:~$ cwb-huffcode -A TEST (fine, nothing wrong)
ray at ray-laptop:~$ cwb-compress-rdx -A TEST (fine again)
I queried the new corpus and nothing broken:
ray at ray-laptop:~$ cqp -eC
[no corpus]> TEST
TEST> "了";
7: 们的行为也引来 <了> 不少公园游客的
29: ,他们早已习惯 <了> 。
TEST> <s> []* "了" []* </s>; (query is OK)
Finally, I resorted to run cwb-scan-corpus manually and did find something usual:
ray at ray-laptop:~$ cwb-scan-corpus -C TEST pos (fully OK)
ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 (segmentation fault)
ray at ray-laptop:~$ cwb-scan-corpus TEST pos+0 pos+1 pos+2
Scanning corpus TEST for 3-tuples ...
Scan complete.
Printing frequency table on stdout ...
...
段错误 ("segmentation fault" in English)
I have very little knowledge in C so I cannot go further to investigate more.
Does anyone know where the problem is? Thanks for any input.
Best,
Ray
Hunan University of Commerce, China
PS: My computer parameters:
System: Ubuntu 8.04
Apache: 2.0.63
MySQL: 5.0.88
PHP: 5.2.12 (lower than expected 5.3.0)
Perl: 5.8.8
CWB: 3.4.3 (compiled from svn source)
Linux utilites: awk, tar, gzip, iconv
LANG=zh_CN.UTF-8
GDM_LANG=zh_CN.UTF-8
Inside cqpweb_chinese_test_utf8.txt:
<text id="test">
<s>
这些 r
网友 n
们 k
的 u
行为 n
也 d
引来 v
了 u
不少 m
公园 n
游客 n
的 u
围观 v
。 w
</s>
<s>
而 c
对于 p
人们 n
的 u
议论 v
, w
这些 r
汉 t
服 v
爱好者 n
表示 v
, w
他们 r
早已 d
习惯 v
了 y
。 w
</s>
</text>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120527/398d4762/attachment-0001.htm
More information about the CWB
mailing list