[CWB] [CQPWeb] problem of memory

Sylvain Loiseau sylvain.loiseau at wanadoo.fr
Wed Feb 8 17:47:47 CET 2012


Thank you for the answers.
This is a large corpora (> 500 000 000  tokens), on a 64 bit server ; after running make-all, the problem disappeared.

I first tried to use cwb-huffcode, but :

---8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<

sloiseau at projldi:~/corpus/CWB/data/le_monde$ cwb-huffcode -A -r /home/sloiseau/corpus/CWB/registry/ le_monde
COMPRESSING TOKEN STREAM of le_monde.word
- writing code descriptor block to /home/sloiseau/corpus/CWB/data/le_monde/word.hcd
- writing compressed item sequence to /home/sloiseau/corpus/CWB/data/le_monde/word.huf
- writing sync (every 128 tokens) to /home/sloiseau/corpus/CWB/data/le_monde/word.huf.syn
VALIDATING le_monde.word
- reading code descriptor block from /home/sloiseau/corpus/CWB/data/le_monde/word.hcd
- reading compressed item sequence from /home/sloiseau/corpus/CWB/data/le_monde/word.huf
- reading sync (mod 128) from /home/sloiseau/corpus/CWB/data/le_monde/word.huf.syn
!! You can delete the file </home/sloiseau/corpus/CWB/data/le_monde/word.corpus> now.
COMPRESSING TOKEN STREAM of le_monde.pos
mmapfile()<storage.c>: Can't mmap() file /home/sloiseau/corpus/CWB/data/le_monde/pos.corpus ...
	You have probably run out of memory / address space!
	Error Message: Cannot allocate memory
attributes:load_component(): Warning:
  Data of CORPUS component of attribute pos can't be loaded
Computation of huffman codes needs the CORPUS component

---8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<

Best and thanks again,
Sylvain

Le 8 févr. 2012 à 12:00, Stefan Evert a écrit :

> 
>> There are two possible causes of this error message. One is that, as it says, you have run out of memory. This is unlikely to be the case, but maybe (I'm speculating here) if your webserver runs under a username with restrictions on how much RAM to can use at once, you might get this - I assume a file called le_monde/word.corpus.rev is going to be rather big!
> 
> Or is it possible that you're using a 32-bit version of the CWB, which is likely to run out of address space for very large corpora?
> 
> This explanation is quite plausible because you don't seem to have compressed the index files (cwb-huffcode and cwb-compress-rdx; or simply run cwb-make from the CWB/Perl package).  Even with a 64-bit CWB, compression is highly recommended!
> 
>> The other possible cause is that the file exists, but is empty (or, perhaps, is not readable by the webserver's username??). So you should check that out as well.
> 
> I don't think this is possible, because then the previous open() [line 301] should already fail. Also note that empty files are virtually mapped beyond the end of file (MMAP_EMPTY_LEN bytes) in order to avoid throwing spurious errors.
> 
>> (NB to self (and Stefan), this is in cl/storage.c, see line 316 & 350 to 354 - and I am not sure why the __svr4__ macro is used at the latter point, posix-compliant systems should have MAP_FAILED defined and therefore the presence of the #ifdef seems pointless.)
> 
> Probably remnants from the good old time when POSIX could still have been the name of a gentlemen's magazine.
> 
> I guess it's about time that we require full ANSI C + POSIX compliance and throw out all the #ifdef's that work around bugs in other platforms.
> 
> Cheers,
> Stefan
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list