[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Hermann Lai halflifelai at gmail.com
Tue Jun 19 21:10:12 CEST 2018


No, I didn't get any messages when I use the frequency list controls.

I am using CQPwebinabox Esmeralda (CQPweb 3.2.11) and CWB 3.4.8(checked by
using "cqb -v").

Regards,
Lai

2018-06-19 23:06 GMT+08:00 Hardie, Andrew <a.hardie at lancaster.ac.uk>:

> Did you get any odd messages when you ran the frequency-list setup on
> CQPweb?
>
>
>
> If not – what version of the code do you have?
>
>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Hermann Lai
> *Sent:* 19 June 2018 11:32
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Incorrect total words count in a Traditional Chinese
> corpus on CQPweb
>
>
>
> part of the output of "cwb-decode -C CANTON1 -ALL | less"
>
>
>
> <s>
>
> <text>
>
> <text_id T01>
>
> 中環    N       中環
>
> 保育    V       保育
>
> 奇觀    N       奇觀
>
> :      PU      :
>
> 孫中山  N       孫中山
>
> 史蹟    N       史蹟
>
> 徑      N       徑
>
> 至      CONJ    至
>
> 大館    N       大館
>
> </text_id>
>
> </text>
>
> </s>
>
>
>
>
>
> part of the output of "cwb-described-corpus -s CANTON1"
>
>
>
> ============================================================
>
> Corpus: CANTON1
>
> ============================================================
>
>
>
> description:
>
> registry file:  /usr/local/share/cwb/registry/canton1
>
> home directory: /usr/local/corpora/data/canton1/
>
> info file:      /usr/local/corpora/data/canton1/.info
>
> size (tokens):  23
>
>
>
>   3 positional attributes
>
>   3 structural attributes
>
>   0 alignment  attributes
>
>
>
> p-ATT word                     23 tokens,       22 types
>
> p-ATT pos                      23 tokens,        8 types
>
> p-ATT lemma                    23 tokens,       22 types
>
> s-ATT s                         2 regions
>
> s-ATT text                      2 regions
>
> s-ATT text_id                   2 regions (with annotations)
>
>
>
>
>
> It seems that CWB can recognize the number of words but CQPweb doesn't.
>
>
>
> Regards,
>
> Lai
>
>
>
> 2018-06-19 15:43 GMT+08:00 Stefan Evert <stefanML at collocations.de>:
>
> What does the corpus look like if you decode it from the CWB index with
> the following command?
>
>         cwb-decode -C CANTON1 -ALL | less
>
> Can you show us part of the output?  It would also be useful to see the
> output of
>
>         cwb-described-corpus -s CANTON1
>
>
> One possibility I can think of is that your linebreaks are messed up so
> that CWB treats everything within the text region as a single long line.
>
> Best,
> Stefan
>
>
> > On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com> wrote:
> >
> > I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus
> called "canton1" by using two commands:
> >
> > sudo cwb-encode -d /usr/local/corpora/data/canton1 -f
> /home/user/Desktop/corpora/canton1/canton1.vrt -R
> /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0
> -S text:0+id
> >
> > sudo cwb-make -V CANTON1
> >
> > After that, I install the corpus onto CQPweb. Most of the thing are
> correct. However, the total number of corpus texts is as same as the total
> words in all corpus texts.
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
>
> --
>
> *Gaspard Germannson*
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>


-- 

*Gaspard Germannson*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180620/a297cf3f/attachment.html>


More information about the CWB mailing list