[CWB] CL: Out of memory. (killed)

Fri Mar 31 06:35:24 CEST 2017

To my mind, a better way around the problem would be to work out why there is a sentence in your corpus big enough to fill a 1 or 2GB string! That must be many tens of millions of words. IE clearly not an actual sentence.

The obvious possibility is some glitch in the coding of the tags for that s-attribute at some point in your input file and that the error cascades so that you get an s-region of this size.

Try the following commands:

Temp = <s> [] expand to s;
dump Temp;

… the difference between col 1 and 2 (begin/end points) of the dump should show you a really big <s> at some point along the line, and the token numbers will give you a hint where to start looking.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Scott Sadowsky
Sent: 31 March 2017 04:40
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] CL: Out of memory. (killed)

On Thu, Mar 30, 2017 at 11:57 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:

Hi Andrew,

Can you try changing the concordance width to a fixed number of characters, say 50, and see if the error  persists?

I just did set c 50 and ran the "ábaco" query, and it worked perfectly.

I then did set c 1s and the query crashed. As the doomed query was being performed, my RAM usage went from 10.8 GB to 12.7 GB (I have 32 GB in total, though).

Also, querying the next couple words in the sentence that appears with "ábaco" causes the same crash.

It rather looks like the cause of the error is an attempt to allocate more memory than is available to the construction of a concordance string. It also looks like your concordance width is set to 1 sentence (s or similar s-attribute). A hit in a very long sentence could, thus, exhaust your memory. But this won’t happen in character-mode width. SO, if the bug persists in a 50 char width concordance., I’m wrong.

I think you've hit the nail on the head.

Being that using a linguistic unit like the sentence produces (to me, at least) much more useful query results than arbitrary numbers of characters or words, is there any way to work around this? Something like a high but hard limit on context size (say, 1000 words, which I just tried successfully), in addition to the user's word or sentence-based limit?

Cheers,
Scott

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 31 March 2017 00:43
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] CL: Out of memory. (killed)

Sure!

The IMS Open Corpus Workbench (CWB)

Copyright (C) 1993-2006 by IMS, University of Stuttgart
Original developer:       Oliver Christ
    with contributions by Bruno Maximilian Schulze
Version 3.0 developed by: Stefan Evert
    with contributions by Arne Fitschen

Copyright (C) 2007-today by the CWB open-source community
    individual contributors are listed in source file AUTHORS

Download and contact: http://cwb.sourceforge.net/

Compiled:  Sun 26 Mar 19:37:22 CLST 2017
Version:   3.4.11

Mind you, I downloaded and compiled the latest development version about a week ago, and that build number isn't shown here. If you need it and can tell me how to get it, I'll be glad to do so.

Cheers!
Scott

On Thu, Mar 30, 2017 at 3:07 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
Hi Scott,

Could you check what version  this is with cqp -v please?

thanks

best

Andrew

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Scott Sadowsky
Sent: 30 March 2017 19:04
To: CWBdev Mailing List
Subject: [CWB] CL: Out of memory. (killed)

When it rains, it pours, I guess!

I have a fairly large corpus (880m words) which I've been using for some time without incident (this is NOT related to the corpus I asked about yesterday, the processing of which topped out at 2^31 tokens).

Unfortunately, I've just happened upon a specific word, which when I search for it with cqp, crashes the program with the following error:

CC-C> "ábaco"
CL: Out of memory. (killed)
CL: [cl_realloc(block at 0x7f7e78c99010 to -2147479552 bytes)]

135515175:  Ahí aparecen : un retrato iluminado de l mandarín Van-ta-gin ; un junco ; un molino de arroz ; los retratos iluminados de un chino y un hoten
tote ; diversos caracteres de la escritura china ; la reproducción de una moneda en anverso y reverso ; la reproducción de los signos grabados en una cap
arazón de tortuga utilizada para la adivinación , con el nombre de " tortue mistique " ; una vista de la parte oriental de Parque de Gé-hol ; el ciclo ch
ino ; un <ábaco> ; el proceso de formación de letras ; reproducción de diversas armas de artillería ; instrumentos musicales como flautas , violines , gu
itarras , trompetas , liras , gongs , tambores , campanas ; un puente ; una aldea y sus habitantes ; la casa de un mandarín y diversas melodías en llave
de sol : Mon-lie-ouha , aires chinos y un aire musical cantado en una chalupa china .
{ ~ } $

The prompt above is the Linux terminal, rather than CQP's command line, by the way. The error comes after pegging the processor core at 100% for a good 30-45 seconds. Results for simple queries like this are normally returned in milliseconds.

Further testing has produced what are to me strange results. "árbol" works fine, but "ébola" crashes CQP, as seen below:

CC-C> "ébola"
CL: Out of memory. (killed)
CL: [cl_realloc(block at 0x7f02d14b7010 to -2147479552 bytes)]

146356674:  SIDA y el <ébola> son corresponde y es falso ,
147036486:  pertenece a l mismo grupo de l mortal virus <ébola> .
178273950:  Hay muchas enfermedades , como el caso de l hanta , de l <ébola> , de l lassa , de l dengue , etcétera , para las cuales no existen vacunas ,
y nuestro Instituto de Salud Pública podría enfrentar las suficientemente .
{ ~ } $

Other searches with word-initial non-ASCII characters have also produced crashes, such as "ácaro". But, as seen above with "árbol", at least one doesn't.

The errors are also happening with words which have non-ASCII characters in other places, such as "esdrújula".

Note that this corpus is UTF-8 encoded.

Any ideas? I've never had this problem before, and I still don't with other corpora of similar size.

Cheers,
Scott

--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb

--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170331/ef6e718e/attachment-0001.html>