[CWB] problems with lemma query and corpus word count

Benedikt Singpiel benedikt.singpiel at uni-leipzig.de
Tue May 10 11:16:39 CEST 2016


Hello Andrew,

thanks again for your quick help.

1. My lemma problem is solved, just as you gathered it was due to  
wrong linebreak settings.


2. The problem of the word count error remains: I did the 'any word'  
search you suggested and it returned half a mollion tokens. That  
should be about right, indexing was successful. I then recreated all  
the frequency/wordcount lists in the 'manage frequency lists' section  
- no changes in the output. Any further suggestions on the issue?


best


Benedikt









Zitat von cwb-request at sslmit.unibo.it:


> Message: 3
> Date: Mon, 9 May 2016 15:30:23 +0000
> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
> To: Open source development of the Corpus WorkBench
> 	<cwb at sslmit.unibo.it>
> Subject: Re: [CWB] problems with lemma query and corpus word count
> Message-ID:
> 	<28078EC3FBF1B940A3EF3D0D19BE351D7FB3536B at EX-0-MB1.lancs.local>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Benedikt,
>
> 1. This is almost certainly due to your input files having CR-LF  
> linebreaks (Windows style), whereas CWB on Unix expects LF linebreaks.
>
> 2. This can have a number of causes. Try doing a query for [] (using  
> CQP syntax: this means "any word"), and see how many results are  
> returned. If only 4, then the problem is at the indexing stage: only  
> 4 words of your text have actually been indexed. If it returns a  
> number equal to the N of tokens in the corpus, then the index is  
> fine, but there is a database error: try re-running frequency list  
> setup and see if it is fixed.
>
> Hope this helps,
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it  
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
> Sent: 09 May 2016 15:23
> To: cwb at sslmit.unibo.it
> Subject: [CWB] problems with lemma query and corpus word count
>
> Hello everyone,
>
> I have got two minor problems bugging me, using CQPweb 3.2.11 (maybe
> answers to them are right a bit more obviopus to you than to me...):
>
>
> 1. When searching for lemma annotations {example}, I don't get any
> results back. Only if I enter {example?} hits for the lemma 'example'
> will show. What ist the problem with my lemma column here (something
> wrong with the line breaks in my indexed text file)?
>
> my text file schema (regular treetagger format):
> Vorwort	NN	?Vorwort
> Es	PPER	es
> ist	VAFIN	sein
> Dienstagmorgen	NN	Dienstagmorgen
>
>
>
> 2. The corpus metadata resume states only 4 'Total words in all corpus
> texts' in a corpus of actually something around 1mio tokens. Why could
> this corpus word count be so wrong (stated no. of words = no. of texts)?
>
>
> best
>
>
> Benedikt Singpiel





More information about the CWB mailing list