[CWB] CQPweb: offline-freqlists.php text-by-text frequency indexing fails

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jun 8 04:27:52 CEST 2015


A followup on this:

With Stefan's help, I think this bug is squished. Anyone experience this bug -- please update to v 3.1.16 (which will be available in my next commit to the repo) and let me know if it *doesn't* go away.

Thanks

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hardie, Andrew
Sent: 31 May 2015 16:12
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] CQPweb: offline-freqlists.php text-by-text frequency indexing fails

Hi Noah,

Apologies for the delayed reply on this one.

I have in fact recently encountered exactly the same problem - only on a rather different corpus (Malaysian Wikipedia, word-tokens only, no tagging). So far I have not worked out what the cause is. I might get back to you with some questions when I get to grips with the bug on my own data.....

I think that I will probably have to rewrite that whole chunk of the code to get better error reporting. Very likely either cwb-encode or cwb-decode (or both) is emitting some kind of alert that may make sense of the problem, but it is not being caught at the upper level of the code. 

If you come across any more possibly-useful info on this bug, I'd be very interested to hear it!

All that said, including a unique token identifier as a p-attribute is going to cause you problems no matter what, I'm afraid. The whole point of CWB's data storage is that the unique token identifiers are the sequence positions, which are implicit, rather than having an explicitly-stored ID as in an RDBMS. 

best

Andrew.



-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Noah Bubenhofer
Sent: 27 May 2015 13:57
To: cwb at sslmit.unibo.it
Subject: [CWB] CQPweb: offline-freqlists.php text-by-text frequency indexing fails

Hi,

I have encountered the following problem with CQPweb: I have corpus
files with not only word, pos and lemma as positional attributes, but in
addition with a word id as the fourth attribute. The calculation of the
text-by-text frequencies using the offline-freqlists.php script then
fails, if the corpus is bigger as a certain amount (about 200.000
tokens). If I remove the column with the word id's, everything works fine.

I investigated a bit on the problem: I used CQPwebInABox, version 3.1.13
of CQPweb (and also the most recent version). The problem appears in the
file freqtable-cwb.inc.php in the foreach loop in line 181: After a
certain amount of lines have been written (or better: piped to
cwb-encode), the execution of the loop just stops, but without any
error. The php process is still active, also cwb-encode, the script does
not die, but nothing more happens. But the iteration through the array
$F in the loop is unfinished. It seems, the pipe to cwb-encode via
fputs() is broken. Also the loop does not stop always precicely at the
same point. I guess there is a buffer which is filled before its content
is written to the pipe.

I tested, if the array $F, which is looped through, is complete or
somehow strange, but that is not the case. Of course, as each token line
of the corpus is unique (because of the word id each token has), the
array is somehow obsolete because it has the length of the number of
tokens in the text and does not countain a list of "types" with their
counts, but instead a list of all tokens with always the value of 1. I
guess, that's not the idea...

But still it is strange, that the pipe to cwb-encode breaks.

Perhaps it would be the best solution to allow the user to tell the
offline-frequency script which positional attributes are part of the
word string which should be considered as entity to count types? Of
course I can also provide the corpus files if someone wants to reproduce
the problem.

Best,
Noah

PS: Despite the minor problems: CQPweb is wonderful, thanks Andrew!
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list