[CWB] Corpus size and filtering

Emiliano Guevara emiliano.guevara at unibo.it
Wed Apr 9 15:04:19 CEST 2008


Hi!

On 4 Apr 2008, at 02:00, Stefan Evert wrote:
>> I have to make available a corpus of about 400 million words  
>> online. Is there any known efficiency
>> issues with CWB when dealing with corpora this large that I should  
>> take into consideration?
>
> 400 million words should just be okay, but you'll have to use the  
> cwb-make Perl script to build the index on a 32-bit machine  
> (because of memory limitations).  This is very close to the size  
> limit that CWB can handle on 32-bit platforms, but if you have a 64- 
> bit machine, there won't be any problems (we've tested up to the  
> theoretical limit of 2 billion words).
>
> A question to everyone else on the list: What's the largest corpus  
> you've used with the CWB? Did you run into problems? At what size  
> does performance begin to degrade?

I just finished installing ITWAC, DEWAC and UKWAC (about 2 billion  
words each) on a 64bit machine (4Gb ram, 20 Gb swap).
I haven't had the time to really test performance, but simple  
concordances run very fast (I can't notice any differences with my  
previous 32bit installation with corpora of 400 M words).

Encoding with cwb-encode went fine with all the corpora.

I only found one problem during the indexation process of UKWAC. The  
corpus is made of 25 different chunks, with a total of 2,433,384,097  
lines (including xml), and this made cwb-makeall choke (I tried many  
different memory settings, but none worked).

The (unfortunate) solution was to leave out a part of the corpus:
a total of 2,200,021,672 lines (including xml), or the first 22  
chunks of the corpus, were just fine for cwb-makeall.

So maybe we are hitting an upper corpus-size limit for cwb-makeall  
that is between 2.2 and 2.4 billion lines (I don't consider the  
number of words because we are indexing the XML/structural attributes  
as well: <text id=""> and <s>).

A last remark, anything below 2.2 billion lines has been totally  
perfect: ITWAC (2,049,857,284 lines) and DEWAC (1,815,463,881 lines)  
didn't present any problems.

Bye,

E.


****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia

Homepage: http://morbo.lingue.unibo.it/

E-mail:   emiliano.guevara at unibo.it
           emiguevara at gmail.com
****************************************



More information about the CWB mailing list