[CWB] CL: Out of memory. (killed)

Scott Sadowsky ssadowsky at gmail.com
Thu Mar 30 20:04:08 CEST 2017


When it rains, it pours, I guess!

I have a fairly large corpus (880m words) which I've been using for some
time without incident (this is NOT related to the corpus I asked about
yesterday, the processing of which topped out at 2^31 tokens).

Unfortunately, I've just happened upon a specific word, which when I search
for it with cqp, crashes the program with the following error:

CC-C> "ábaco"
CL: Out of memory. (killed)
CL: [cl_realloc(block at 0x7f7e78c99010 to -2147479552 bytes)]

135515175:  Ahí aparecen : un retrato iluminado de l mandarín Van-ta-gin ;
un junco ; un molino de arroz ; los retratos iluminados de un chino y un
hoten
tote ; diversos caracteres de la escritura china ; la reproducción de una
moneda en anverso y reverso ; la reproducción de los signos grabados en una
cap
arazón de tortuga utilizada para la adivinación , con el nombre de " tortue
mistique " ; una vista de la parte oriental de Parque de Gé-hol ; el ciclo
ch
ino ; un <ábaco> ; el proceso de formación de letras ; reproducción de
diversas armas de artillería ; instrumentos musicales como flautas ,
violines , gu
itarras , trompetas , liras , gongs , tambores , campanas ; un puente ; una
aldea y sus habitantes ; la casa de un mandarín y diversas melodías en
llave
de sol : Mon-lie-ouha , aires chinos y un aire musical cantado en una
chalupa china .
{ ~ } $

The prompt above is the Linux terminal, rather than CQP's command line, by
the way. The error comes after pegging the processor core at 100% for a
good 30-45 seconds. Results for simple queries like this are normally
returned in milliseconds.

Further testing has produced what are to me strange results. "árbol" works
*fine*, but "ébola" crashes CQP, as seen below:

CC-C> "ébola"
CL: Out of memory. (killed)
CL: [cl_realloc(block at 0x7f02d14b7010 to -2147479552 bytes)]

146356674:  SIDA y el <ébola> son corresponde y es falso ,
147036486:  pertenece a l mismo grupo de l mortal virus <ébola> .
178273950:  Hay muchas enfermedades , como el caso de l hanta , de l <ébola>
, de l lassa , de l dengue , etcétera , para las cuales no existen vacunas ,
y nuestro Instituto de Salud Pública podría enfrentar las suficientemente .
{ ~ } $

Other searches with word-initial non-ASCII characters have also produced
crashes, such as "ácaro". But, as seen above with "árbol", at least one
doesn't.

The errors are also happening with words which have non-ASCII characters in
other places, such as "esdrújula".

Note that this corpus is UTF-8 encoded.

Any ideas? I've never had this problem before, and I still don't with other
corpora of similar size.

Cheers,
Scott



--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170330/df8569c0/attachment.html>


More information about the CWB mailing list