[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Tue Jun 19 17:06:26 CEST 2018

Did you get any odd messages when you ran the frequency-list setup on CQPweb?

If not – what version of the code do you have?

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hermann Lai
Sent: 19 June 2018 11:32
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

part of the output of "cwb-decode -C CANTON1 -ALL | less"

<s>
<text>
<text_id T01>
中環    N       中環
保育    V       保育
奇觀    N       奇觀
：      PU      ：
孫中山  N       孫中山
史蹟    N       史蹟
徑      N       徑
至      CONJ    至
大館    N       大館
</text_id>
</text>
</s>

part of the output of "cwb-described-corpus -s CANTON1"

============================================================
Corpus: CANTON1
============================================================

description:
registry file:  /usr/local/share/cwb/registry/canton1
home directory: /usr/local/corpora/data/canton1/
info file:      /usr/local/corpora/data/canton1/.info
size (tokens):  23

  3 positional attributes
  3 structural attributes
  0 alignment  attributes

p-ATT word                     23 tokens,       22 types
p-ATT pos                      23 tokens,        8 types
p-ATT lemma                    23 tokens,       22 types
s-ATT s                         2 regions
s-ATT text                      2 regions
s-ATT text_id                   2 regions (with annotations)

It seems that CWB can recognize the number of words but CQPweb doesn't.

Regards,
Lai

2018-06-19 15:43 GMT+08:00 Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>>:
What does the corpus look like if you decode it from the CWB index with the following command?

        cwb-decode -C CANTON1 -ALL | less

Can you show us part of the output?  It would also be useful to see the output of

        cwb-described-corpus -s CANTON1

One possibility I can think of is that your linebreaks are messed up so that CWB treats everything within the text region as a single long line.

Best,
Stefan

> On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com<mailto:halflifelai at gmail.com>> wrote:
>
> I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus called "canton1" by using two commands:
>
> sudo cwb-encode -d /usr/local/corpora/data/canton1 -f /home/user/Desktop/corpora/canton1/canton1.vrt -R /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0 -S text:0+id
>
> sudo cwb-make -V CANTON1
>
> After that, I install the corpus onto CQPweb. Most of the thing are correct. However, the total number of corpus texts is as same as the total words in all corpus texts.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

--
Gaspard Germannson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180619/d28f5579/attachment-0001.html>