[CWB] CL: Out of memory. (killed)

Stefan Evert stefanML at collocations.de
Fri Mar 31 10:50:06 CEST 2017


> On 30 Mar 2017, at 20:04, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> CC-C> "ábaco" 
> CL: Out of memory. (killed)                                  
> CL: [cl_realloc(block at 0x7f7e78c99010 to -2147479552 bytes)] 

The immediate cause of this crash is that something in CQP attempts to allocate a buffer of more than 2 GiB but uses a signed 32-bit int to calculate the size, so it wraps around to a negative number.

The ensuing discussion suggests that the culprit is Andrew's implementation of automatically growing strings, which never expected to have to deal with such huge strings.  It would probably better to fail with a CQP error if the KWIC context gets larger than 1M characters (or perhaps 100M), but I'm not sure how easy that is to fit into CQP's haphazard error handling.

(@Andrew: shouldn't we consider moving to C++, if just for the sake of exceptions?)

As Andrew pointed out, the root cause of the problem is that your corpus seems to contain a sentence of several hundred million tokens (so it formats to over 2 GiB).  This easily happens if there's a missing </s> tag somewhere in the middle and you encode with "-S s:0" (because the following sentences are nested in the one that hasn't been closed).  You probably got warnings about missing </s> tags when you encoded the corpus, didn't you?

If you can't be sure that the structural annotation in a corpus is well-formed XML, it's often better to do a flat encode with "-S s".

Best,
Stefan


More information about the CWB mailing list