[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Hermann Lai halflifelai at gmail.com
Tue Jun 19 09:26:26 CEST 2018


Hello everyone,

I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus
called "canton1" by using two commands:

sudo cwb-encode -d /usr/local/corpora/data/canton1 -f
/home/user/Desktop/corpora/canton1/canton1.vrt -R
/usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0
-S text:0+id

sudo cwb-make -V CANTON1

After that, I install the corpus onto CQPweb. Most of the thing are
correct. However, the total number of corpus texts is as same as the total
words in all corpus texts.

Excerpt of the vrt file of "canton1"(those spaces are tab):

<text id="T01">
<s>
中環 N 中環
保育 V 保育
奇觀 N 奇觀
</s>
</text>

Excerpt of metadata file of "canton1":

T01 beginning
T02 ending

How to fix this problem? Thank you

Best regards from Hong Kong,
Lai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180619/14057ec0/attachment.html>


More information about the CWB mailing list