[CWB] CL: Out of memory. (killed)
Scott Sadowsky
ssadowsky at gmail.com
Sat Apr 1 02:50:54 CEST 2017
On Fri, Mar 31, 2017 at 5:48 AM, Stefan Evert <stefanML at collocations.de>
wrote:
Hi Stefan,
As Andrew pointed out, the root cause of the problem is that your corpus
> seems to contain a sentence of several hundred million tokens (so it
> formats to over 2 GiB). This easily happens if there's a missing </s> tag
> somewhere in the middle and you encode with "-S s:0" (because the following
> sentences are nested in the one that hasn't been closed). You probably got
> warnings about missing </s> tags when you encoded the corpus, didn't you?
>
If you can't be sure that the structural annotation in a corpus is
> well-formed XML, it's often better to do a flat encode with "-S s".
I encoded this corpus some years ago, so I have no recollection of what
warnings I received. But I can say this was the set of options I used:
-xsB -P lemma -P pos -P spos -P tag -P subtag -S s:0 -S p:0 -S
text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source
And I do indeed have -S s:0, as well as -S p:0 and even -S text:0+... From
reading the encoding tutorial, the :0 option seems to prevent nested
elements, which sounded like a good idea... at the time. Would it be
advisable to drop the :0 from all three elements above, or only from s:0?
Cheers,
Scott
--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile
ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170331/74302efe/attachment.html>
More information about the CWB
mailing list