[CWB] CL: Out of memory. (killed)

Scott Sadowsky ssadowsky at gmail.com
Sat Apr 1 02:50:54 CEST 2017


On Fri, Mar 31, 2017 at 5:48 AM, Stefan Evert <stefanML at collocations.de>
wrote:

Hi Stefan,

As Andrew pointed out, the root cause of the problem is that your corpus
> seems to contain a sentence of several hundred million tokens (so it
> formats to over 2 GiB).  This easily happens if there's a missing </s> tag
> somewhere in the middle and you encode with "-S s:0" (because the following
> sentences are nested in the one that hasn't been closed).  You probably got
> warnings about missing </s> tags when you encoded the corpus, didn't you?
>
If you can't be sure that the structural annotation in a corpus is
> well-formed XML, it's often better to do a flat encode with "-S s".


I encoded this corpus some years ago, so I have no recollection of what
warnings I received. But I can say this was the set of options I used:

-xsB -P lemma -P pos -P spos -P tag -P subtag -S s:0 -S p:0 -S
text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source

And I do indeed have -S s:0, as well as -S p:0 and even -S text:0+... From
reading the encoding tutorial, the :0 option seems to prevent nested
elements, which sounded like a good idea... at the time. Would it be
advisable to drop the :0 from all three elements above, or only from s:0?

Cheers,
Scott

-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170331/74302efe/attachment.html>


More information about the CWB mailing list