[CWB] problems with lemma query and corpus word count
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon May 9 17:30:23 CEST 2016
Hi Benedikt,
1. This is almost certainly due to your input files having CR-LF linebreaks (Windows style), whereas CWB on Unix expects LF linebreaks.
2. This can have a number of causes. Try doing a query for [] (using CQP syntax: this means "any word"), and see how many results are returned. If only 4, then the problem is at the indexing stage: only 4 words of your text have actually been indexed. If it returns a number equal to the N of tokens in the corpus, then the index is fine, but there is a database error: try re-running frequency list setup and see if it is fixed.
Hope this helps,
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
Sent: 09 May 2016 15:23
To: cwb at sslmit.unibo.it
Subject: [CWB] problems with lemma query and corpus word count
Hello everyone,
I have got two minor problems bugging me, using CQPweb 3.2.11 (maybe
answers to them are right a bit more obviopus to you than to me...):
1. When searching for lemma annotations {example}, I don't get any
results back. Only if I enter {example?} hits for the lemma 'example'
will show. What ist the problem with my lemma column here (something
wrong with the line breaks in my indexed text file)?
my text file schema (regular treetagger format):
Vorwort NN Vorwort
Es PPER es
ist VAFIN sein
Dienstagmorgen NN Dienstagmorgen
2. The corpus metadata resume states only 4 'Total words in all corpus
texts' in a corpus of actually something around 1mio tokens. Why could
this corpus word count be so wrong (stated no. of words = no. of texts)?
best
Benedikt Singpiel
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list