[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb
Hardie, Andrew
a.hardie at lancaster.ac.uk
Tue Jun 19 17:06:26 CEST 2018
Did you get any odd messages when you ran the frequency-list setup on CQPweb?
If not – what version of the code do you have?
best
Andrew.
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hermann Lai
Sent: 19 June 2018 11:32
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb
part of the output of "cwb-decode -C CANTON1 -ALL | less"
<s>
<text>
<text_id T01>
中環 N 中環
保育 V 保育
奇觀 N 奇觀
: PU :
孫中山 N 孫中山
史蹟 N 史蹟
徑 N 徑
至 CONJ 至
大館 N 大館
</text_id>
</text>
</s>
part of the output of "cwb-described-corpus -s CANTON1"
============================================================
Corpus: CANTON1
============================================================
description:
registry file: /usr/local/share/cwb/registry/canton1
home directory: /usr/local/corpora/data/canton1/
info file: /usr/local/corpora/data/canton1/.info
size (tokens): 23
3 positional attributes
3 structural attributes
0 alignment attributes
p-ATT word 23 tokens, 22 types
p-ATT pos 23 tokens, 8 types
p-ATT lemma 23 tokens, 22 types
s-ATT s 2 regions
s-ATT text 2 regions
s-ATT text_id 2 regions (with annotations)
It seems that CWB can recognize the number of words but CQPweb doesn't.
Regards,
Lai
2018-06-19 15:43 GMT+08:00 Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>>:
What does the corpus look like if you decode it from the CWB index with the following command?
cwb-decode -C CANTON1 -ALL | less
Can you show us part of the output? It would also be useful to see the output of
cwb-described-corpus -s CANTON1
One possibility I can think of is that your linebreaks are messed up so that CWB treats everything within the text region as a single long line.
Best,
Stefan
> On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com<mailto:halflifelai at gmail.com>> wrote:
>
> I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus called "canton1" by using two commands:
>
> sudo cwb-encode -d /usr/local/corpora/data/canton1 -f /home/user/Desktop/corpora/canton1/canton1.vrt -R /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0 -S text:0+id
>
> sudo cwb-make -V CANTON1
>
> After that, I install the corpus onto CQPweb. Most of the thing are correct. However, the total number of corpus texts is as same as the total words in all corpus texts.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
--
Gaspard Germannson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180619/d28f5579/attachment-0001.html>
More information about the CWB
mailing list