[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Stefan Evert stefanML at collocations.de
Tue Jun 19 09:43:50 CEST 2018


What does the corpus look like if you decode it from the CWB index with the following command?

	cwb-decode -C CANTON1 -ALL | less

Can you show us part of the output?  It would also be useful to see the output of

	cwb-described-corpus -s CANTON1


One possibility I can think of is that your linebreaks are messed up so that CWB treats everything within the text region as a single long line. 

Best,
Stefan


> On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com> wrote:
> 
> I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus called "canton1" by using two commands:
> 
> sudo cwb-encode -d /usr/local/corpora/data/canton1 -f /home/user/Desktop/corpora/canton1/canton1.vrt -R /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0 -S text:0+id
> 
> sudo cwb-make -V CANTON1
> 
> After that, I install the corpus onto CQPweb. Most of the thing are correct. However, the total number of corpus texts is as same as the total words in all corpus texts.



More information about the CWB mailing list