[CWB] CL: Out of memory. (killed)
ssadowsky at gmail.com
Sat Apr 1 02:50:54 CEST 2017
On Fri, Mar 31, 2017 at 5:48 AM, Stefan Evert <stefanML at collocations.de>
As Andrew pointed out, the root cause of the problem is that your corpus
> seems to contain a sentence of several hundred million tokens (so it
> formats to over 2 GiB). This easily happens if there's a missing </s> tag
> somewhere in the middle and you encode with "-S s:0" (because the following
> sentences are nested in the one that hasn't been closed). You probably got
> warnings about missing </s> tags when you encoded the corpus, didn't you?
If you can't be sure that the structural annotation in a corpus is
> well-formed XML, it's often better to do a flat encode with "-S s".
I encoded this corpus some years ago, so I have no recollection of what
warnings I received. But I can say this was the set of options I used:
-xsB -P lemma -P pos -P spos -P tag -P subtag -S s:0 -S p:0 -S
And I do indeed have -S s:0, as well as -S p:0 and even -S text:0+... From
reading the encoding tutorial, the :0 option seems to prevent nested
elements, which sounded like a good idea... at the time. Would it be
advisable to drop the :0 from all three elements above, or only from s:0?
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile
ssadowsky gmail com
scsadowsky uc cl
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CWB