[CWB] problems with lemma query and corpus word count

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 9 17:30:23 CEST 2016


Hi Benedikt,

1. This is almost certainly due to your input files having CR-LF linebreaks (Windows style), whereas CWB on Unix expects LF linebreaks.

2. This can have a number of causes. Try doing a query for [] (using CQP syntax: this means "any word"), and see how many results are returned. If only 4, then the problem is at the indexing stage: only 4 words of your text have actually been indexed. If it returns a number equal to the N of tokens in the corpus, then the index is fine, but there is a database error: try re-running frequency list setup and see if it is fixed.

Hope this helps,

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
Sent: 09 May 2016 15:23
To: cwb at sslmit.unibo.it
Subject: [CWB] problems with lemma query and corpus word count

Hello everyone,

I have got two minor problems bugging me, using CQPweb 3.2.11 (maybe  
answers to them are right a bit more obviopus to you than to me...):


1. When searching for lemma annotations {example}, I don't get any  
results back. Only if I enter {example?} hits for the lemma 'example'  
will show. What ist the problem with my lemma column here (something  
wrong with the line breaks in my indexed text file)?

my text file schema (regular treetagger format):
Vorwort	NN	Vorwort
Es	PPER	es
ist	VAFIN	sein
Dienstagmorgen	NN	Dienstagmorgen



2. The corpus metadata resume states only 4 'Total words in all corpus  
texts' in a corpus of actually something around 1mio tokens. Why could  
this corpus word count be so wrong (stated no. of words = no. of texts)?


best


Benedikt Singpiel




_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list