[CWB] Unable to index a corpus

VIVALDI PALATRESI, JORGE jorge.vivaldi at upf.edu
Thu Jul 27 09:52:53 CEST 2017


Hi Andrew,
As you suggest, if the "Feature set" box is not set the process continues
but now the browser indicates a new error:

-->$<--
array(7137) {
  [0]=>
  string(442) "/usr/local/cwb-3.4.12/bin/cwb-encode -xsB -c utf8 -d
/var/local/CQPweb/index/iulact_ca -f
/var/local/CQPweb/upload/ca_allDocsReduced.cqp -R
"/var/local/CQPweb/registry/iulact_ca"  -P lemma -P pos -P token -S
text+id -S txt -S p -S s -S head -S hi -S abbr -S name -S loc -S na -S
num -S date -S foreign -S ptr -S div1 -S div2 -S div3 -S div4 -S div5
-S div6 -S div7 -S div8 -S list -S note -S figure -S table -S row -S
item -S cell -S gap 2>&1"
  [1]=>
  string(117) "Close tag  without matching open tag ignored (file
/var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #3470)."
  [2]=>
  string(117) "Close tag  without matching open tag ignored (file
/var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #3971)."
  [3]=>
  string(117) "Close tag  without matching open tag ignored (file
/var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #6713)."
...
  [7133]=>
  string(120) "Close tag  without matching open tag ignored (file
/var/local/CQPweb/upload/ca_allDocsReduced.cqp, line #4991289)."
  [7134]=>
  string(53) "Warning: missing  tag inserted at end of input."
  [7135]=>
  string(53) "Warning: missing  tag inserted at end of input."
  [7136]=>
  string(87) "/usr/local/cwb-3.4.12/bin/cwb-makeall -r
"/var/local/CQPweb/registry" -V IULACT_CA 2>&1"
}

My test corpus include some nested tags and I think that this may be the
cause of the error.  The "CWB Encoding Tutorial" indicates that in this
case the cwb-encode should be called with the parameter -S <tag>:0. Where
<tag> indicates tags which are nested. This situation repeats quite often
in my corpora so I need to fin a solution. Therefore, my questions are:
- Is there any way to indicate to CQPweb this situation?
- how do I force to CQPweb to invoke cwb-encode with parameter updated?
- this parameter includes ":0", this "0" must replaced by the maximum
nested level?
Any suggestion to solve this question is very welcomed.

Thank you in advance,

Jorge

PD. Perhaps I should open a new thread in CWB list


2017-07-26 14:51 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk>:

> Hi Jorge,
>
>
>
> As suspected, seeing a few of the error messages makes it clear what is
> happening.
>
>
>
> “is not a valid feature set” is thee error message. You are getting one
> per POS tag in the corpus (since your POS tags aren’t feature sets). One
> error message per POS tag in the corpus exhausts the available RAM.
>
>
>
> The answer is simple: don’t tick the “Feature set?” box when your tags
> aren’t actually feature sets.
>
>
>
> To read about what feature sets are / aren’t in CWB/CQPweb, have a look at
> the CWB Encoding Tutorial (search for the word “feature” in the PDF to find
> the relevant section.)
>
>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *VIVALDI PALATRESI, JORGE
> *Sent:* 26 July 2017 12:23
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] Unable to index a corpus
>
>
>
> Hi Andrew,
>
> I made the suggested modifications in the file lib/admin-install.inc.php
> without any positive result the browser always becomes blank with a similar
> messages in the file error.log.
>
> Then, I reduced the size of corpus to a 20% of the original. Then some
> messages appeared on the browser, but after a while the browser crashes.
> Anyway I captured the included screenshot.
>
> It seems to be a problem related to the cwb-encode and the POS tags
> (N5-FS, JQ--FS, P, ...).
> As I mention in a previous message the corpus (and its POS tags) is the
> same used with a previous version of CQP and CQPweb.
>
> Bests,
>
> Jorge
>
>
>
> 2017-07-26 10:22 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk>:
>
> As I noted before, the problem is actually error messages. Line 644 simply
> collects error messages – so an out-of-memory error here indicates you have
> generated > 4GB of error messages.
>
>
>
> I suggested increasing the memory previously because it would let you see
> the problem – but actually, with 4GB of error messages, I’d suggest that
> doing that is not likely to help much,
>
>
>
> So what I would suggest instead is hacking the code to find out the error
> message.
>
>
>
> Open admin-lib.inbc.php
>
> Go to line 644
>
> Find the line nearby that says *$output_lines_from_cwb* = array(
> *$encode_command*);
>
>
>
> AFTER THAT LINE, but before the line that says exec(*$encode_command*,
> *$output_lines_from_cwb*, *$exit_status_from_cwb*); add the following:
>
>
>
> *if (count($output_lines_from_cwb) > 1000)
> {show_var($output_lines_from_cwb); exiterror("abort"); }*
>
>
>
> What this line does is make things abort if it detects too many error
> messages.
>
>
>
> If you then get a readable error message, that might give you a hint what
> the real problem is. If not, try again moving the location o fthe hack line
> down the file, before the following lines:
>
>
>
> before exec($makeall_command, $output_lines_from_cwb,
> $exit_status_from_cwb);
>
> before exec($compress_command, $compression_output, $exit_status_from_cwb
> );
>
> before the second example fo exec($makeall_command, $output_lines_from_cwb
> , $exit_status_from_cwb);
>
> before } */* end else (from if cwb index already exists) */*
>
>
>
> Hopefully, as I say, doing this will get you a gimpse of the first 1,000
> lines of erro, which may tell you what the underlying problem is.
>
>
>
> Hope this helps
>
>
>
> best
>
>
>
> Andrew.
>


-- 
Jorge Vivaldi Palatresi
Institut Universitari de Lingüística Aplicada
Universitat Pompeu Fabra
C/ Roc Boronat, 138
08018 Barcelona
Espanya

+34 93 542 2332 <+34%20935%2042%2023%2032>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170727/b0fbb087/attachment-0001.html>


More information about the CWB mailing list