[CWB] Corpus size and filtering
Emiliano Guevara
emiliano.guevara at unibo.it
Wed Apr 9 15:04:19 CEST 2008
Hi!
On 4 Apr 2008, at 02:00, Stefan Evert wrote:
>> I have to make available a corpus of about 400 million words
>> online. Is there any known efficiency
>> issues with CWB when dealing with corpora this large that I should
>> take into consideration?
>
> 400 million words should just be okay, but you'll have to use the
> cwb-make Perl script to build the index on a 32-bit machine
> (because of memory limitations). This is very close to the size
> limit that CWB can handle on 32-bit platforms, but if you have a 64-
> bit machine, there won't be any problems (we've tested up to the
> theoretical limit of 2 billion words).
>
> A question to everyone else on the list: What's the largest corpus
> you've used with the CWB? Did you run into problems? At what size
> does performance begin to degrade?
I just finished installing ITWAC, DEWAC and UKWAC (about 2 billion
words each) on a 64bit machine (4Gb ram, 20 Gb swap).
I haven't had the time to really test performance, but simple
concordances run very fast (I can't notice any differences with my
previous 32bit installation with corpora of 400 M words).
Encoding with cwb-encode went fine with all the corpora.
I only found one problem during the indexation process of UKWAC. The
corpus is made of 25 different chunks, with a total of 2,433,384,097
lines (including xml), and this made cwb-makeall choke (I tried many
different memory settings, but none worked).
The (unfortunate) solution was to leave out a part of the corpus:
a total of 2,200,021,672 lines (including xml), or the first 22
chunks of the corpus, were just fine for cwb-makeall.
So maybe we are hitting an upper corpus-size limit for cwb-makeall
that is between 2.2 and 2.4 billion lines (I don't consider the
number of words because we are indexing the XML/structural attributes
as well: <text id=""> and <s>).
A last remark, anything below 2.2 billion lines has been totally
perfect: ITWAC (2,049,857,284 lines) and DEWAC (1,815,463,881 lines)
didn't present any problems.
Bye,
E.
****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia
Homepage: http://morbo.lingue.unibo.it/
E-mail: emiliano.guevara at unibo.it
emiguevara at gmail.com
****************************************
More information about the CWB
mailing list