[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Tue Jun 19 12:32:05 CEST 2018

part of the output of "cwb-decode -C CANTON1 -ALL | less"

<s>
<text>
<text_id T01>
中環    N       中環
保育    V       保育
奇觀    N       奇觀
：      PU      ：
孫中山  N       孫中山
史蹟    N       史蹟
徑      N       徑
至      CONJ    至
大館    N       大館
</text_id>
</text>
</s>

part of the output of "cwb-described-corpus -s CANTON1"

============================================================
Corpus: CANTON1
============================================================

description:
registry file:  /usr/local/share/cwb/registry/canton1
home directory: /usr/local/corpora/data/canton1/
info file:      /usr/local/corpora/data/canton1/.info
size (tokens):  23

  3 positional attributes
  3 structural attributes
  0 alignment  attributes

p-ATT word                     23 tokens,       22 types
p-ATT pos                      23 tokens,        8 types
p-ATT lemma                    23 tokens,       22 types
s-ATT s                         2 regions
s-ATT text                      2 regions
s-ATT text_id                   2 regions (with annotations)

It seems that CWB can recognize the number of words but CQPweb doesn't.

Regards,
Lai

2018-06-19 15:43 GMT+08:00 Stefan Evert <stefanML at collocations.de>:

> What does the corpus look like if you decode it from the CWB index with
> the following command?
>
>         cwb-decode -C CANTON1 -ALL | less
>
> Can you show us part of the output?  It would also be useful to see the
> output of
>
>         cwb-described-corpus -s CANTON1
>
>
> One possibility I can think of is that your linebreaks are messed up so
> that CWB treats everything within the text region as a single long line.
>
> Best,
> Stefan
>
>
> > On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com> wrote:
> >
> > I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus
> called "canton1" by using two commands:
> >
> > sudo cwb-encode -d /usr/local/corpora/data/canton1 -f
> /home/user/Desktop/corpora/canton1/canton1.vrt -R
> /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0
> -S text:0+id
> >
> > sudo cwb-make -V CANTON1
> >
> > After that, I install the corpus onto CQPweb. Most of the thing are
> correct. However, the total number of corpus texts is as same as the total
> words in all corpus texts.
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>

-- 

*Gaspard Germannson*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180619/e0aa1dc8/attachment.html>