[CWB] Unable to index a corpus

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Jul 26 14:51:33 CEST 2017


Hi Jorge,

As suspected, seeing a few of the error messages makes it clear what is happening.

“is not a valid feature set” is thee error message. You are getting one per POS tag in the corpus (since your POS tags aren’t feature sets). One error message per POS tag in the corpus exhausts the available RAM.

The answer is simple: don’t tick the “Feature set?” box when your tags aren’t actually feature sets.

To read about what feature sets are / aren’t in CWB/CQPweb, have a look at the CWB Encoding Tutorial (search for the word “feature” in the PDF to find the relevant section.)

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of VIVALDI PALATRESI, JORGE
Sent: 26 July 2017 12:23
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Unable to index a corpus

Hi Andrew,
I made the suggested modifications in the file lib/admin-install.inc.php without any positive result the browser always becomes blank with a similar messages in the file error.log.
Then, I reduced the size of corpus to a 20% of the original. Then some messages appeared on the browser, but after a while the browser crashes. Anyway I captured the included screenshot.
It seems to be a problem related to the cwb-encode and the POS tags (N5-FS, JQ--FS, P, ...).
As I mention in a previous message the corpus (and its POS tags) is the same used with a previous version of CQP and CQPweb.
Bests,
Jorge

2017-07-26 10:22 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
As I noted before, the problem is actually error messages. Line 644 simply collects error messages – so an out-of-memory error here indicates you have generated > 4GB of error messages.

I suggested increasing the memory previously because it would let you see the problem – but actually, with 4GB of error messages, I’d suggest that doing that is not likely to help much,

So what I would suggest instead is hacking the code to find out the error message.

Open admin-lib.inbc.php
Go to line 644
Find the line nearby that says $output_lines_from_cwb = array($encode_command);

AFTER THAT LINE, but before the line that says exec($encode_command, $output_lines_from_cwb, $exit_status_from_cwb); add the following:

if (count($output_lines_from_cwb) > 1000) {show_var($output_lines_from_cwb); exiterror("abort"); }

What this line does is make things abort if it detects too many error messages.

If you then get a readable error message, that might give you a hint what the real problem is. If not, try again moving the location o fthe hack line down the file, before the following lines:

before exec($makeall_command, $output_lines_from_cwb, $exit_status_from_cwb);
before exec($compress_command, $compression_output, $exit_status_from_cwb);
before the second example fo exec($makeall_command, $output_lines_from_cwb, $exit_status_from_cwb);
before } /* end else (from if cwb index already exists) */

Hopefully, as I say, doing this will get you a gimpse of the first 1,000 lines of erro, which may tell you what the underlying problem is.

Hope this helps

best

Andrew.

--
Jorge Vivaldi Palatresi
Institut Universitari de Lingüística Aplicada
Universitat Pompeu Fabra
C/ Roc Boronat, 138
08018 Barcelona
Espanya

+34 93 542 2332
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170726/f49f85bd/attachment-0001.html>


More information about the CWB mailing list