[CWB] Ooops.. Cannot allocate memory :)

Stefan Evert stefanML at collocations.de
Sun Dec 12 23:07:48 CET 2010


> I was trying to import a.. erm.. big corpus.
> 7258905 translation units

How many tokens is that?

> These ones did not complain:
> 
> Running [cwb-encode -c utf8 -d /home/corpora/eurlex_pt_en_tmx_pt -f source.cqp -R /usr/local/share/cwb/registry/eurlex_pt_en_tmx_pt -S tu+id]
> Running [cwb-make -v EURLEX_PT_EN_TMX_PT]
> Running [cwb-encode -c utf8 -d /home/corpora/eurlex_pt_en_tmx_en -f target.cqp -R /usr/local/share/cwb/registry/eurlex_pt_en_tmx_en -S tu+id]
> Running [cwb-make -v EURLEX_PT_EN_TMX_EN]

Yes, should be fine if you didn't get any errors.

> So, hopefully, CWB imported the corpora.
> But, then...
> 
> Running [cwb-align-import -v align.txt]
> CWB::OpenFile: Can't open file/pipe '/tmp/imported_alignment.16716.gz' in mode '>': Cannot allocate memory at /usr/local/share/perl5/CWB.pm line 371

Sounds like Perl ran out of memory, but I find that rather surprising.  I assume you're using tu_id to identify the alignment regions?

The Perl script builds the entire alignment data structures in memory before writing them to disk (in case the alignment beads are unordered), so there may be problems with very large corpora.  But Perl should be able to handle a little over 7 million alignment beads.  However, if your IDs for the grid regions are very long strings, that may waste too much memory.  How big are the "tu_id.avs" files for the two corpora?

Any chance that you're running 32-bit Perl? (You can find out with "perl -V | grep ptrsize" if you're not sure.)  I've found that I often need the 64-bit version when I do corpus work in Perl -- one of the best reasons for upgrading to Snow Leopard. :)

Best,
Stefan


More information about the CWB mailing list