[CWB] Corpus size and filtering
Stefan Evert
stefan.evert at uos.de
Wed Apr 9 23:24:36 CEST 2008
Hi Emiliano!
Thanks for sharing your experiences.
> I just finished installing ITWAC, DEWAC and UKWAC (about 2 billion
> words each) on a 64bit machine (4Gb ram, 20 Gb swap).
> I haven't had the time to really test performance, but simple
> concordances run very fast (I can't notice any differences with my
> previous 32bit installation with corpora of 400 M words).
That sounds good!
> Encoding with cwb-encode went fine with all the corpora.
>
> I only found one problem during the indexation process of UKWAC.
> The corpus is made of 25 different chunks, with a total of
> 2,433,384,097 lines (including xml), and this made cwb-makeall
> choke (I tried many different memory settings, but none worked).
>
> The (unfortunate) solution was to leave out a part of the corpus:
> a total of 2,200,021,672 lines (including xml), or the first 22
> chunks of the corpus, were just fine for cwb-makeall
>
> So maybe we are hitting an upper corpus-size limit for cwb-makeall
> that is between 2.2 and 2.4 billion lines (I don't consider the
> number of words because we are indexing the XML/structural
> attributes as well: <text id=""> and <s>).
It's worse than that ... you've hit the theoretical size limit of the
CWB. On a 64bit machine, it's actually the number of tokens that
matters, since XML attributes are stored separately and do not count
against the theoretical limit.
The reason for the limit is that internally, tokens are addressed
with signed 32 bit integers, which translates into a maximum of 2^31
= approx. 2.1 billion tokens. Unfortunately, there is no easy way to
get around this limit. We could go up to 4 billion tokens if we used
unsigned integers, but that would lead to enormous compatibility
problems (inside the CWB as well as with other software using the CL
library), and Web corpora will hit the 4 billion ceiling as well soon
enough.
On our machine, I managed to encode the first 23 parts of ukwac-1.0
with slightly above 2.1 billion tokens -- very close to the
theoretical limit.
Best wishes,
Stefan
More information about the CWB
mailing list