[CWB] CL: Out of memory. (killed)

Scott Sadowsky ssadowsky at gmail.com
Sat Apr 1 02:34:15 CEST 2017


On Fri, Mar 31, 2017 at 1:35 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Hi Andrew,

To my mind, a better way around the problem would be to work out why there
> is a sentence in your corpus big enough to fill a 1 or 2GB string! That
> must be many tens of millions of words. IE clearly not an actual sentence.
>

No doubt that would be a better way. But at the same time, this kind of
situation could easily cause big problems in a server environment, and it's
not at all obvious -- I've been using this corpus on and off for a couple
years, and only now stumbled upon this landmine. And the same corpus, but
tagged with different software, doesn't seem to have the problem as far as
I can tell. So a failsafe would still be a good idea, assuming it doesn't
require too much coding.


The obvious possibility is some glitch in the coding of the tags for that
> s-attribute at some point in your input file and that the error cascades so
> that you get an s-region of this size. Try the following commands:
>
>
>
> Temp = <s> [] expand to s;
>
> dump Temp;
>
>
>
> … the difference between col 1 and 2 (begin/end points) of the dump should
> show you a really big <s> at some point along the line, and the token
> numbers will give you a hint where to start looking.
>

I wrote the results to a file, but it's 371 MB of numbers. I'm afraid I
won't be able to make heads or tails of that!

Cheers,
Scott




>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Scott Sadowsky
> *Sent:* 31 March 2017 04:40
>
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] CL: Out of memory. (killed)
>
>
>
> On Thu, Mar 30, 2017 at 11:57 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> Hi Andrew,
>
>
>
> Can you try changing the concordance width to a fixed number of
> characters, say 50, and see if the error  persists?
>
>
> I just did set c 50 and ran the "ábaco" query, and it worked perfectly.
>
> I then did set c 1s and the query crashed. As the doomed query was being
> performed, my RAM usage went from 10.8 GB to 12.7 GB (I have 32 GB in
> total, though).
>
>
>
> Also, querying the next couple words in the sentence that appears with
> "ábaco" causes the same crash.
>
>
>
>
>
> It rather looks like the cause of the error is an attempt to allocate more
> memory than is available to the construction of a concordance string. It
> also looks like your concordance width is set to 1 sentence (s or similar
> s-attribute). A hit in a very long sentence could, thus, exhaust your
> memory. But this won’t happen in character-mode width. SO, if the bug
> persists in a 50 char width concordance., I’m wrong.
>
>
>
> I think you've hit the nail on the head.
>
>
>
> Being that using a linguistic unit like the sentence produces (to me, at
> least) much more useful query results than arbitrary numbers of characters
> or words, is there any way to work around this? Something like a high but
> hard limit on context size (say, 1000 words, which I just tried
> successfully), *in addition to* the user's word or sentence-based limit?
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Scott Sadowsky
> *Sent:* 31 March 2017 00:43
> *To:* Open source development of the Corpus WorkBench
> *Subject:* Re: [CWB] CL: Out of memory. (killed)
>
>
>
> Sure!
>
>
>
> The IMS Open Corpus Workbench (CWB)
>
>
>
> Copyright (C) 1993-2006 by IMS, University of Stuttgart
>
> Original developer:       Oliver Christ
>
>     with contributions by Bruno Maximilian Schulze
>
> Version 3.0 developed by: Stefan Evert
>
>     with contributions by Arne Fitschen
>
>
>
> Copyright (C) 2007-today by the CWB open-source community
>
>     individual contributors are listed in source file AUTHORS
>
>
>
> Download and contact: http://cwb.sourceforge.net/
>
>
>
> Compiled:  Sun 26 Mar 19:37:22 CLST 2017
>
> Version:   3.4.11
>
>
>
> Mind you, I downloaded and compiled the latest development version about a
> week ago, and that build number isn't shown here. If you need it and can
> tell me how to get it, I'll be glad to do so.
>
>
>
> Cheers!
>
> Scott
>
>
>
> On Thu, Mar 30, 2017 at 3:07 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
> Hi Scott,
>
>
>
> Could you check what version  this is with *cqp -v* please?
>
>
>
> thanks
>
>
>
> best
>
>
>
> Andrew
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Scott Sadowsky
> *Sent:* 30 March 2017 19:04
> *To:* CWBdev Mailing List
> *Subject:* [CWB] CL: Out of memory. (killed)
>
>
>
> When it rains, it pours, I guess!
>
>
>
> I have a fairly large corpus (880m words) which I've been using for some
> time without incident (this is NOT related to the corpus I asked about
> yesterday, the processing of which topped out at 2^31 tokens).
>
>
>
> Unfortunately, I've just happened upon a specific word, which when I
> search for it with cqp, crashes the program with the following error:
>
>
>
> CC-C> "ábaco"
> CL: Out of memory. (killed)
> CL: [cl_realloc(block at 0x7f7e78c99010 to -2147479552 bytes)]
>
> 135515175:  Ahí aparecen : un retrato iluminado de l mandarín Van-ta-gin
> ; un junco ; un molino de arroz ; los retratos iluminados de un chino y un
> hoten
> tote ; diversos caracteres de la escritura china ; la reproducción de una
> moneda en anverso y reverso ; la reproducción de los signos grabados en una
> cap
> arazón de tortuga utilizada para la adivinación , con el nombre de "
> tortue mistique " ; una vista de la parte oriental de Parque de Gé-hol ; el
> ciclo ch
> ino ; un <ábaco> ; el proceso de formación de letras ; reproducción de
> diversas armas de artillería ; instrumentos musicales como flautas ,
> violines , gu
> itarras , trompetas , liras , gongs , tambores , campanas ; un puente ;
> una aldea y sus habitantes ; la casa de un mandarín y diversas melodías en
> llave
> de sol : Mon-lie-ouha , aires chinos y un aire musical cantado en una
> chalupa china .
> *{ ~ } $*
>
>
> The prompt above is the Linux terminal, rather than CQP's command line, by
> the way. The error comes after pegging the processor core at 100% for a
> good 30-45 seconds. Results for simple queries like this are normally
> returned in milliseconds.
>
>
>
> Further testing has produced what are to me strange results. "árbol" works
> *fine*, but "ébola" crashes CQP, as seen below:
>
>
>
> CC-C> "ébola"
> CL: Out of memory. (killed)
> CL: [cl_realloc(block at 0x7f02d14b7010 to -2147479552 bytes)]
>
> 146356674:  SIDA y el <ébola> son corresponde y es falso ,
> 147036486:  pertenece a l mismo grupo de l mortal virus <ébola> .
> 178273950:  Hay muchas enfermedades , como el caso de l hanta , de l <
> ébola> , de l lassa , de l dengue , etcétera , para las cuales no existen
> vacunas ,
> y nuestro Instituto de Salud Pública podría enfrentar las suficientemente
> .
> *{ ~ } $*
>
>
> Other searches with word-initial non-ASCII characters have also produced
> crashes, such as "ácaro". But, as seen above with "árbol", at least one
> doesn't.
>
>
>
> The errors are also happening with words which have non-ASCII characters
> in other places, such as "esdrújula".
>
>
>
> Note that this corpus is UTF-8 encoded.
>
>
>
> Any ideas? I've never had this problem before, and I still don't with
> other corpora of similar size.
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
> --
>
> Dr. Scott Sadowsky
> Profesor Asistente de Lingüística
>
> Pontificia Universidad Católica de Chile
>
>
>
> ssadowsky gmail com
>
> scsadowsky uc cl
> http://sadowsky.cl/
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
>
> --
>
> Dr. Scott Sadowsky
> Profesor Asistente de Lingüística
>
> Pontificia Universidad Católica de Chile
>
>
>
> ssadowsky gmail com
>
> scsadowsky uc cl
> http://sadowsky.cl/
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
>
> --
>
> Dr. Scott Sadowsky
> Profesor Asistente de Lingüística
>
> Pontificia Universidad Católica de Chile
>
>
>
> ssadowsky gmail com
>
> scsadowsky uc cl
> http://sadowsky.cl/
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>


-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170331/4ece657d/attachment-0001.html>


More information about the CWB mailing list