[CWB] CL: Out of memory. (killed)

Stefan Evert stefanML at collocations.de
Sat Apr 1 08:58:34 CEST 2017


> On 1 Apr 2017, at 02:50, Scott Sadowsky <ssadowsky at gmail.com> wrote:
> 
> -xsB -P lemma -P pos -P spos -P tag -P subtag -S s:0 -S p:0 -S text:0+id+corpus+tagger+label+channel+audience+purpose+type+medium+field+area+location+source
> 
> And I do indeed have -S s:0, as well as -S p:0 and even -S text:0+... From reading the encoding tutorial, the :0 option seems to prevent nested elements, which sounded like a good idea... at the time. Would it be advisable to drop the :0 from all three elements above, or only from s:0?

If you have unvalidated input data and the elements in question are not supposed to be nested, it's in fact better to drop the +0.  The encoding tutorial makes the somewhat unrealistic assumption that the input file does not have any missing or superfluous tags.

If you're on linux you can do the following to find exceedingly long sentences:

	Temp = <s> [] expand to s;
	dump Temp > " | awk '($2 - $1 > 10000)' ";

which should display all <s> units that are longer than 10,000 tokens.

Best,
Stefan


More information about the CWB mailing list