[CWB] [cwb:bugs] #59 cwb-makeall aborts uninformatively if there are too many attributes
andrewhardie at users.sf.net
Mon Jun 30 00:08:33 CEST 2014
Some investigation shows: a hard limit on the number of attributes that can be produced from an input file by cwb-encode is imposed by the limit the OS imposes on the number of files that can be opened at once - discoverable programmatically with getrlimit, or in bash with `ulimit -n`. On my system, that's 1024 - remember however that an attribute can require up to 3 files (less only in the case of s-attribute, and the limit may be a lot less than that).
Using one p-attribute and as many s-attributes as I could without hitting the 1024 limit (374, one-seventh of them without annotations, rest with, plus one p-att), however, I was not able to reproduce the bug: the registry parsed fine.
I did track down what triggers this error message in the registry parser: it is triggered by the built in Bison symbol [error], part of the top-level rule for the registry file:
Registry: Header Declaration | error ;
(this compiles to rule 4 in the output of bison). However, this is also uninformative, since the error might have bubbled up from any one of the attribute declarations.
Two actions taken:
(1) I have documented in man cwb-encode that the limit of attributes should be considered to be 341 (not exactly true, but a rough rule of thumb).
(2) I have changed the uninformative “Parse Error” message to “Error parsing the main Registry structure”, which is still not explanatory, but at least gives some indication of where the error has occurred. (This is in the action applied to the “error” symbol” in the Registry rule in registry.y).
Otherwise, we'll consider this closed.
** [bugs:#59] cwb-makeall aborts uninformatively if there are too many attributes**
**Labels:** CL low-level library
**Created:** Fri Sep 13, 2013 11:15 AM UTC by Andrew Hardie
**Last Updated:** Fri Sep 13, 2013 11:15 AM UTC
**Owner:** Andrew Hardie
It is possible to create a corpus using cwb-encode that cannot be processed by cwb-makeall (or any tool that uses the CL, but cwb-makeall is the one where you notice it!) because it has too many attributes.
The bug is as follows: the YACC parser produces an error, with the message printed twice:
(plus boiler plate inserted by cregerror_cleanup())
This seems to arise from the "error" component in the Yacc grammar in parser.y.
I triggered this error with S-attributes (I was indexing *lots* of XML elements, each of which had 3 or 4 attributes). Removing most of them from the registry file made the problem go away. Thus, I infer that the problem was caused by too many attributes - I assume it could not have been a syntax error in the reg file, because that was written by cwb-encode!
Solution: I am not even sure one is needed because this entire area of the code will go away in v 3.9/4.0, and it is clearly not hampering most users. If there *is* an effective maximum number of attributes, perhaps that should be mentioned in the cwb-encode man file for 3.5.
Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/bugs/
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CWB