[CWB] [ cwb-Bugs-2906451 ] CQPweb: compilation of text-frequency-index CWB corpora

SourceForge.net noreply at sourceforge.net
Mon Dec 14 07:53:40 CET 2009


Bugs item #2906451, was opened at 2009-12-01 02:12
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2906451&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQPweb
Group: None
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: CQPweb: compilation of text-frequency-index CWB corpora

Initial Comment:
This process seems very prone to either (a) running out of PHP memory or (b) timing out or (c) fillng up the hard disk and then falling over.

A full, proper investigation is needed.

Stefan suggests two improvements:

First, you should use -M switch for cwb-makeall so that it doesn't try to do the entire indexing in memory.  
Second, it would be even better to use the cwb-make script from the CWB/Perl interface (or the corresponding Perl module directly), which minimises disk usage by compressing data files as early as possible.

----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2009-12-14 06:53

Message:
Commandline version, with progress messages (every mysql query is printed
to the command line among other things) is now done, but not tested.

----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2009-12-05 16:05

Message:
The current state of affairs is that the functions print nothing out,
either as they go or when they finish. I will amend this so that they print
progress (either as HTML with a backlink for the onlne interface, or as
plaintext to STDOUT for the offline scripts).

Each is a single function with parameters passed to it via HTTP get and
execute.php. Making this work from the commandline should be quick.

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2009-12-01 11:10

Message:
In my tests with a 500-million-word corpus, the MySQL indexing of frequency
tables ("Create frequency tables") turned out to be much worse.  It quickly
ate up some 5 GB of disk space, and then kept running for hours without any
tangible result until Firefox and/or Apache collapsed.  As of now, MySQL
still seems to be busy building the table and index (perhaps this is a side
effect of the LOAD DATA LOCAL INFILE, which first has store all data in the
server and then creates table-cum-index in a single go?), but it will
probably never get past the word attribute (for all I can tell, the PHP
script has long since aborted).

I think the only reasonable way to use large corpora (>> 100M words) with
CQPweb is to have a command-line version of the indexing script, which can
be run in the background without any time constraints.  This may not be all
that hard: the only input required is the name of the corpus and the action
to be performed, i.e. two CGI parameters.  Such a command-line version
could also print a few reassuring progress messages while it's working.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2906451&group_id=131809


More information about the CWB mailing list