[CWB] problems with corpus word count

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu May 19 02:22:09 CEST 2016


Oh, one further note: having checked the code, a possible cause of this bug is that the text metadata table contains incorrect start/end points for the texts. You can check this with the following query:

select text_id, words, cqp_begin, cqp_end from text_metadata_for_PILOT;

If the three numeric columns contain zero, that explains your problem.

The cause of this would be failure to get accurate text-size information from CQP. To work out why *that* is, I'd need to see the errors from running the " Generate CWB text-position records " process - which is the first step in frequency list setup - and which you can re-run on its own by going to the CQPweb web-root and then typing:

cd bin
php execute-cli.php populate_corpus_cqp_positions PILOT

and see what error message you get.

One possible source of error is that it looks like you've used an all-upper corpus name ie "PILOT" not "pilot"... this may interact badly with the way the CWB registry works, which in turn could have caused the problem. Possibly.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: 19 May 2016 01:10
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] problems with corpus word count

Hi Benedikt,

The word-counting code has been updated recently. I am not sure, off the top of my head, what version is currently on the VM image. Looks to me like it is a version containing conflicting assumptions resulting in somehow the n of texts being inserted into the n of tokens field.... I'll have to fix that. IT's not something I've seen on my own server or my development machine so I am not 100% sure how it happened. 

In the mean time you can patch things manually by running the following SQL statement

update corpus_info set size_texts = NUMBER_GOES_HERE, size_tokens = 458874 where corpus = "PILOT";

and you can fix things for future corproa by running "svn up" within the VM's web-directory for CQPWeb (enable networking to do this).

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
Sent: 18 May 2016 11:58
To: cwb at sslmit.unibo.it
Subject: [CWB] problems with corpus word count

Hello Andrew,

Now I got it! Here are my values:


mysql> select count(*) from freq_corpus_pilot_word;
+----------+
| count(*) |
+----------+
|    37250 |
+----------+
1 row in set (0.14 sec)

mysql> select sum(freq) from freq_corpus_pilot_word;
+-----------+
| sum(freq) |
+-----------+
|    458874 |
+-----------+
1 row in set (0.07 sec)

mysql> select size_types, size_tokens, size_texts from corpus_info  
where corpus = "PILOT";
+------------+-------------+------------+
| size_types | size_tokens | size_texts |
+------------+-------------+------------+
|      37250 |           4 |          4 |
+------------+-------------+------------+

Frequency tables seem complete, corpus log isn't.


best

Benedikt


> ------------------------------
>
> Message: 3
> Date: Tue, 17 May 2016 12:16:11 +0000
> From: "Hardie, Andrew" <a.hardie at lancaster.ac.uk>
> To: Open source development of the Corpus WorkBench
> 	<cwb at sslmit.unibo.it>
> Subject: Re: [CWB] problems with lemma query and corpus word count
> Message-ID:
> 	<28078EC3FBF1B940A3EF3D0D19BE351D7FB3DA97 at EX-0-MB1.lancs.local>
> Content-Type: text/plain; charset="utf-8"
>
> You have to select the database first.
>
> http://dev.mysql.com/doc/refman/5.7/en/use.html
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it  
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Benedikt Singpiel
> Sent: 17 May 2016 13:15
> To: cwb at sslmit.unibo.it
> Subject: [CWB] problems with lemma query and corpus word count



_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list