[CWB] CQPweb: offline-freqlists.php text-by-text frequency indexing fails

Noah Bubenhofer lists at bubenhofer.com
Wed May 27 14:56:44 CEST 2015


Hi,

I have encountered the following problem with CQPweb: I have corpus
files with not only word, pos and lemma as positional attributes, but in
addition with a word id as the fourth attribute. The calculation of the
text-by-text frequencies using the offline-freqlists.php script then
fails, if the corpus is bigger as a certain amount (about 200.000
tokens). If I remove the column with the word id's, everything works fine.

I investigated a bit on the problem: I used CQPwebInABox, version 3.1.13
of CQPweb (and also the most recent version). The problem appears in the
file freqtable-cwb.inc.php in the foreach loop in line 181: After a
certain amount of lines have been written (or better: piped to
cwb-encode), the execution of the loop just stops, but without any
error. The php process is still active, also cwb-encode, the script does
not die, but nothing more happens. But the iteration through the array
$F in the loop is unfinished. It seems, the pipe to cwb-encode via
fputs() is broken. Also the loop does not stop always precicely at the
same point. I guess there is a buffer which is filled before its content
is written to the pipe.

I tested, if the array $F, which is looped through, is complete or
somehow strange, but that is not the case. Of course, as each token line
of the corpus is unique (because of the word id each token has), the
array is somehow obsolete because it has the length of the number of
tokens in the text and does not countain a list of "types" with their
counts, but instead a list of all tokens with always the value of 1. I
guess, that's not the idea...

But still it is strange, that the pipe to cwb-encode breaks.

Perhaps it would be the best solution to allow the user to tell the
offline-frequency script which positional attributes are part of the
word string which should be considered as entity to count types? Of
course I can also provide the corpus files if someone wants to reproduce
the problem.

Best,
Noah

PS: Despite the minor problems: CQPweb is wonderful, thanks Andrew!


More information about the CWB mailing list