[CWB] Unable to index a corpus

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Aug 2 02:54:02 CEST 2017


Hi Jorge,

(sorry for delayed reply)

There is no really super-satisfactory way to deal with this in CQPweb. CWB really doesn’t like self-nestings of the same XML element. S-attributes are designed  to represent disjunct non-overlapping regions; thus, pseudo-XML rather than actual XML. (Of course the design of CWB long predates XML, and at the time even SGML was in its early days I believe.) This is one of the things that the new Ziggurat engine will fix when Stefan and I finally get to it, incidentally.

At present you have a choice of 3 bodges available in command-line cwb-encode: (a) with +N, to automatically rename nested elements so you get tag1, tag2, tag3 as your attributes; (b) with no +N, to treat every new <tag> as the beginning of a new non-nested region even if the previous one is unclosed; (c) with +0, to totally ignore nested regions.

Since these are all bodges, and the situation is going to change in the reasonably near future, I did not make any of these methods transparent in the CQPweb interface. If you want to use them, create the corpus manually with cwb-encode, then insert it as an already-indexed corpus in CQPweb.

best

Andrew.







From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of VIVALDI PALATRESI, JORGE
Sent: 27 July 2017 08:53
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Unable to index a corpus

Hi Andrew,
As you suggest, if the "Feature set" box is not set the process continues but now the browser indicates a new error:

-->$<--

array(7137) {

  [0]=>

  string(442) "/usr/local/cwb-3.4.12/bin/cwb-encode -xsB -c utf8 -d /var/local/CQPweb/index/iulact_ca -f /var/local/CQPweb/upload/ca_allDocsReduced.cqp -R "/var/local/CQPweb/registry/iulact_ca"  -P lemma -P pos -P token -S text+id -S txt -S p -S s -S head -S hi -S abbr -S name -S loc -S na -S num -S date -S foreign -S ptr -S div1 -S div2 -S div3 -S div4 -S div5 -S div6 -S div7 -S div8 -S list -S note -S figure -S table -S row -S item -S cell -S gap 2>&1"

  [1]=>

  string(117) "Close tag  without matching open tag ignored (file /var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #3470)."

  [2]=>

  string(117) "Close tag  without matching open tag ignored (file /var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #3971)."

  [3]=>

  string(117) "Close tag  without matching open tag ignored (file /var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #6713)."
...
  [7133]=>

  string(120) "Close tag  without matching open tag ignored (file /var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #4991289)."

  [7134]=>

  string(53) "Warning: missing  tag inserted at end of input."

  [7135]=>

  string(53) "Warning: missing  tag inserted at end of input."

  [7136]=>

  string(87) "/usr/local/cwb-3.4.12/bin/cwb-makeall -r "/var/local/CQPweb/registry" -V IULACT_CA 2>&1"

}
My test corpus include some nested tags and I think that this may be the cause of the error.  The "CWB Encoding Tutorial" indicates that in this case the cwb-encode should be called with the parameter -S <tag>:0. Where <tag> indicates tags which are nested. This situation repeats quite often in my corpora so I need to fin a solution. Therefore, my questions are:
- Is there any way to indicate to CQPweb this situation?
- how do I force to CQPweb to invoke cwb-encode with parameter updated?
- this parameter includes ":0", this "0" must replaced by the maximum nested level?
Any suggestion to solve this question is very welcomed.
Thank you in advance,
Jorge
PD. Perhaps I should open a new thread in CWB list

2017-07-26 14:51 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
Hi Jorge,

As suspected, seeing a few of the error messages makes it clear what is happening.

“is not a valid feature set” is thee error message. You are getting one per POS tag in the corpus (since your POS tags aren’t feature sets). One error message per POS tag in the corpus exhausts the available RAM.

The answer is simple: don’t tick the “Feature set?” box when your tags aren’t actually feature sets.

To read about what feature sets are / aren’t in CWB/CQPweb, have a look at the CWB Encoding Tutorial (search for the word “feature” in the PDF to find the relevant section.)

best

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of VIVALDI PALATRESI, JORGE
Sent: 26 July 2017 12:23
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Unable to index a corpus

Hi Andrew,
I made the suggested modifications in the file lib/admin-install.inc.php without any positive result the browser always becomes blank with a similar messages in the file error.log.
Then, I reduced the size of corpus to a 20% of the original. Then some messages appeared on the browser, but after a while the browser crashes. Anyway I captured the included screenshot.
It seems to be a problem related to the cwb-encode and the POS tags (N5-FS, JQ--FS, P, ...).
As I mention in a previous message the corpus (and its POS tags) is the same used with a previous version of CQP and CQPweb.
Bests,
Jorge

2017-07-26 10:22 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
As I noted before, the problem is actually error messages. Line 644 simply collects error messages – so an out-of-memory error here indicates you have generated > 4GB of error messages.

I suggested increasing the memory previously because it would let you see the problem – but actually, with 4GB of error messages, I’d suggest that doing that is not likely to help much,

So what I would suggest instead is hacking the code to find out the error message.

Open admin-lib.inbc.php
Go to line 644
Find the line nearby that says $output_lines_from_cwb = array($encode_command);

AFTER THAT LINE, but before the line that says exec($encode_command, $output_lines_from_cwb, $exit_status_from_cwb); add the following:

if (count($output_lines_from_cwb) > 1000) {show_var($output_lines_from_cwb); exiterror("abort"); }

What this line does is make things abort if it detects too many error messages.

If you then get a readable error message, that might give you a hint what the real problem is. If not, try again moving the location o fthe hack line down the file, before the following lines:

before exec($makeall_command, $output_lines_from_cwb, $exit_status_from_cwb);
before exec($compress_command, $compression_output, $exit_status_from_cwb);
before the second example fo exec($makeall_command, $output_lines_from_cwb, $exit_status_from_cwb);
before } /* end else (from if cwb index already exists) */

Hopefully, as I say, doing this will get you a gimpse of the first 1,000 lines of erro, which may tell you what the underlying problem is.

Hope this helps

best

Andrew.


--
Jorge Vivaldi Palatresi
Institut Universitari de Lingüística Aplicada
Universitat Pompeu Fabra
C/ Roc Boronat, 138
08018 Barcelona
Espanya

+34 93 542 2332<tel:+34%20935%2042%2023%2032>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170802/38e10970/attachment-0001.html>


More information about the CWB mailing list