[CWB] problems with lemma query and corpus word count

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue May 10 12:47:18 CEST 2016


Withbout more info on what is going on in your database I can't tell what the problem is.

Try the following queries in the mysql client:

  select count(*) from freq_corpus_YOURCORPUS_word;
  select sum(freq) from freq_corpus_YOURCORPUS_word;

That will tell you whether or not the frequency table in the database is complete (type/token counts respectively).

Then try

  select size_types, size_tokens, size_texts from corpus_info where corpus = "YOURCORPUS";

That will tell you whether the info in the corpus log is accurate. 

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
Sent: 10 May 2016 10:17
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] problems with lemma query and corpus word count

Hello Andrew,

thanks again for your quick help.

1. My lemma problem is solved, just as you gathered it was due to  
wrong linebreak settings.


2. The problem of the word count error remains: I did the 'any word'  
search you suggested and it returned half a mollion tokens. That  
should be about right, indexing was successful. I then recreated all  
the frequency/wordcount lists in the 'manage frequency lists' section  
- no changes in the output. Any further suggestions on the issue?


best


Benedikt









Zitat von cwb-request at sslmit.unibo.it:


> Message: 3
> Date: Mon, 9 May 2016 15:30:23 +0000
> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
> To: Open source development of the Corpus WorkBench
> 	<cwb at sslmit.unibo.it>
> Subject: Re: [CWB] problems with lemma query and corpus word count
> Message-ID:
> 	<28078EC3FBF1B940A3EF3D0D19BE351D7FB3536B at EX-0-MB1.lancs.local>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Benedikt,
>
> 1. This is almost certainly due to your input files having CR-LF  
> linebreaks (Windows style), whereas CWB on Unix expects LF linebreaks.
>
> 2. This can have a number of causes. Try doing a query for [] (using  
> CQP syntax: this means "any word"), and see how many results are  
> returned. If only 4, then the problem is at the indexing stage: only  
> 4 words of your text have actually been indexed. If it returns a  
> number equal to the N of tokens in the corpus, then the index is  
> fine, but there is a database error: try re-running frequency list  
> setup and see if it is fixed.
>
> Hope this helps,
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it  
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
> Sent: 09 May 2016 15:23
> To: cwb at sslmit.unibo.it
> Subject: [CWB] problems with lemma query and corpus word count
>
> Hello everyone,
>
> I have got two minor problems bugging me, using CQPweb 3.2.11 (maybe
> answers to them are right a bit more obviopus to you than to me...):
>
>
> 1. When searching for lemma annotations {example}, I don't get any
> results back. Only if I enter {example?} hits for the lemma 'example'
> will show. What ist the problem with my lemma column here (something
> wrong with the line breaks in my indexed text file)?
>
> my text file schema (regular treetagger format):
> Vorwort	NN	?Vorwort
> Es	PPER	es
> ist	VAFIN	sein
> Dienstagmorgen	NN	Dienstagmorgen
>
>
>
> 2. The corpus metadata resume states only 4 'Total words in all corpus
> texts' in a corpus of actually something around 1mio tokens. Why could
> this corpus word count be so wrong (stated no. of words = no. of texts)?
>
>
> best
>
>
> Benedikt Singpiel



_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list