[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Tue Jun 19 09:43:50 CEST 2018

What does the corpus look like if you decode it from the CWB index with the following command?

	cwb-decode -C CANTON1 -ALL | less

Can you show us part of the output?  It would also be useful to see the output of

	cwb-described-corpus -s CANTON1

One possibility I can think of is that your linebreaks are messed up so that CWB treats everything within the text region as a single long line. 

Best,
Stefan

> On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com> wrote:
> 
> I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus called "canton1" by using two commands:
> 
> sudo cwb-encode -d /usr/local/corpora/data/canton1 -f /home/user/Desktop/corpora/canton1/canton1.vrt -R /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0 -S text:0+id
> 
> sudo cwb-make -V CANTON1
> 
> After that, I install the corpus onto CQPweb. Most of the thing are correct. However, the total number of corpus texts is as same as the total words in all corpus texts.