<div dir="ltr">When it rains, it pours, I guess!<div><br></div><div>I have a fairly large corpus (880m words) which I've been using for some time without incident (this is NOT related to the corpus I asked about yesterday, the processing of which topped out at 2^31 tokens). </div><div><br></div><div>Unfortunately, I've just happened upon a specific word, which when I search for it with cqp, crashes the program with the following error:</div><div><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">CC-C> "ábaco"
</span><br>CL: Out of memory. (killed) <br>CL: [cl_realloc(block at 0x7f7e78c99010 to -2147479552 bytes)]
<br>
<br><span style="color:rgb(24,24,178)">135515175:</span><span style="color:rgb(0,0,0)"> Ahí aparecen : un retrato iluminado de l mandarín Van-ta-gin ; un junco ; un molino de arroz ; los retratos iluminados de un chino y un hoten</span><br>tote ; diversos caracteres de la escritura china ; la reproducción de una moneda en anverso y reverso ; la reproducción de los signos grabados en una cap<br>arazón de tortuga utilizada para la adivinación , con el nombre de " tortue mistique " ; una vista de la parte oriental de Parque de Gé-hol ; el ciclo ch<br>ino ; un <<span style="color:rgb(255,255,255);background-color:rgb(0,0,0)">ábaco</span><span style="color:rgb(0,0,0)">> ; el proceso de formación de letras ; reproducción de diversas armas de artillería ; instrumentos musicales como flautas , violines , gu</span><br>itarras , trompetas , liras , gongs , tambores , campanas ; un puente ; una aldea y sus habitantes ; la casa de un mandarín y diversas melodías en llave <br>de sol : Mon-lie-ouha , aires chinos y un aire musical cantado en una chalupa china .
<br><span style="font-weight:bold;color:rgb(84,84,255)">{ ~ }</span><span style="font-weight:bold;color:rgb(84,255,84)"> $</span><br></span></div><div><br clear="all"><div>The prompt above is the Linux terminal, rather than CQP's command line, by the way. The error comes after pegging the processor core at 100% for a good 30-45 seconds. Results for simple queries like this are normally returned in milliseconds.</div><div><br></div><div>Further testing has produced what are to me strange results. "árbol" works <u>fine</u>, but "ébola" crashes CQP, as seen below:</div><div><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">CC-C> "ébola"
</span><br>CL: Out of memory. (killed) <br>CL: [cl_realloc(block at 0x7f02d14b7010 to -2147479552 bytes)]
<br>
<br><span style="color:rgb(24,24,178)">146356674:</span><span style="color:rgb(0,0,0)"> SIDA y el <</span><span style="color:rgb(255,255,255);background-color:rgb(0,0,0)">ébola</span><span style="color:rgb(0,0,0)">> son corresponde y es falso ,
</span><br><span style="color:rgb(24,24,178)">147036486:</span><span style="color:rgb(0,0,0)"> pertenece a l mismo grupo de l mortal virus <</span><span style="color:rgb(255,255,255);background-color:rgb(0,0,0)">ébola</span><span style="color:rgb(0,0,0)">> .
</span><br><span style="color:rgb(24,24,178)">178273950:</span><span style="color:rgb(0,0,0)"> Hay muchas enfermedades , como el caso de l hanta , de l <</span><span style="color:rgb(255,255,255);background-color:rgb(0,0,0)">ébola</span><span style="color:rgb(0,0,0)">> , de l lassa , de l dengue , etcétera , para las cuales no existen vacunas ,</span><br> y nuestro Instituto de Salud Pública podría enfrentar las suficientemente .
<br><span style="font-weight:bold;color:rgb(84,84,255)">{ ~ }</span><span style="font-weight:bold;color:rgb(84,255,84)"> $</span><br></span></div><br>Other searches with word-initial non-ASCII characters have also produced crashes, such as "ácaro". But, as seen above with "árbol", at least one doesn't.</div><div><br></div><div>The errors are also happening with words which have non-ASCII characters in other places, such as "esdrújula".</div><div><br></div><div>Note that this corpus is UTF-8 encoded.</div><div><br></div><div>Any ideas? I've never had this problem before, and I still don't with other corpora of similar size.</div><div><br></div><div>Cheers,</div><div>Scott</div><div><br></div><div><br><br>--<div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div style="font-size:12.7273px">Dr. Scott Sadowsky<br>Profesor Asistente de Lingüística</div><div dir="ltr" style="font-size:12.7273px">Pontificia Universidad Católica de Chile<br></div><div dir="ltr" style="font-size:12.7273px"><br></div><div dir="ltr" style="font-size:12.7273px">ssadowsky gmail com</div><div dir="ltr" style="font-size:12.7273px">scsadowsky uc cl<br><a href="http://sadowsky.cl/" target="_blank">http://sadowsky.cl/</a></div><div dir="ltr" style="font-size:12.7273px"> </div></div></div></div></div></div></div></div></div></div></div>
</div></div>