[CWB] [ cwb-Bugs-2906451 ] CQPweb: compilation of text-frequency-index CWB corpora

SourceForge.net noreply at sourceforge.net
Tue Jun 1 02:13:10 CEST 2010


Bugs item #2906451, was opened at 2009-12-01 02:12
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2906451&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQPweb
Group: None
>Status: Closed
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: CQPweb: compilation of text-frequency-index CWB corpora

Initial Comment:
This process seems very prone to either (a) running out of PHP memory or (b) timing out or (c) fillng up the hard disk and then falling over.

A full, proper investigation is needed.

Stefan suggests two improvements:

First, you should use -M switch for cwb-makeall so that it doesn't try to do the entire indexing in memory.  
Second, it would be even better to use the cwb-make script from the CWB/Perl interface (or the corresponding Perl module directly), which minimises disk usage by compressing data files as early as possible.

----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2010-06-01 00:13

Message:
I am closing this, because the following changes have been made in bits and
pieces:

(1) we have the commandline script for making frequency tables, and these
print plenty of progress messages. It's been tested quite a bit by now.

(2) there is now a warning in the setup manual that creating freq tables
for big corpora may need lots of temporary disk space for MySQL (not much
we can do about this other than warn people!)

(3) cwb-makeall uses the -M switch - this is incidental to the
frequency-table problem, but helps the corpus setup procedure overall.

----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2009-12-14 06:53

Message:
Commandline version, with progress messages (every mysql query is printed
to the command line among other things) is now done, but not tested.

----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2009-12-05 16:05

Message:
The current state of affairs is that the functions print nothing out,
either as they go or when they finish. I will amend this so that they print
progress (either as HTML with a backlink for the onlne interface, or as
plaintext to STDOUT for the offline scripts).

Each is a single function with parameters passed to it via HTTP get and
execute.php. Making this work from the commandline should be quick.

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2009-12-01 11:10

Message:
In my tests with a 500-million-word corpus, the MySQL indexing of frequency
tables ("Create frequency tables") turned out to be much worse.  It quickly
ate up some 5 GB of disk space, and then kept running for hours without any
tangible result until Firefox and/or Apache collapsed.  As of now, MySQL
still seems to be busy building the table and index (perhaps this is a side
effect of the LOAD DATA LOCAL INFILE, which first has store all data in the
server and then creates table-cum-index in a single go?), but it will
probably never get past the word attribute (for all I can tell, the PHP
script has long since aborted).

I think the only reasonable way to use large corpora (>> 100M words) with
CQPweb is to have a command-line version of the indexing script, which can
be run in the background without any time constraints.  This may not be all
that hard: the only input required is the name of the corpus and the action
to be performed, i.e. two CGI parameters.  Such a command-line version
could also print a few reassuring progress messages while it's working.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2906451&group_id=131809


More information about the CWB mailing list