[CWB] Corpus size and filtering

Wed Apr 9 23:24:36 CEST 2008

Hi Emiliano!

Thanks for sharing your experiences.

> I just finished installing ITWAC, DEWAC and UKWAC (about 2 billion  
> words each) on a 64bit machine (4Gb ram, 20 Gb swap).
> I haven't had the time to really test performance, but simple  
> concordances run very fast (I can't notice any differences with my  
> previous 32bit installation with corpora of 400 M words).

That sounds good!

> Encoding with cwb-encode went fine with all the corpora.
>
> I only found one problem during the indexation process of UKWAC.  
> The corpus is made of 25 different chunks, with a total of  
> 2,433,384,097 lines (including xml), and this made cwb-makeall  
> choke (I tried many different memory settings, but none worked).
>
> The (unfortunate) solution was to leave out a part of the corpus:
> a total of 2,200,021,672 lines (including xml), or the first 22  
> chunks of the corpus, were just fine for cwb-makeall
>
> So maybe we are hitting an upper corpus-size limit for cwb-makeall  
> that is between 2.2 and 2.4 billion lines (I don't consider the  
> number of words because we are indexing the XML/structural  
> attributes as well: <text id=""> and <s>).

It's worse than that ... you've hit the theoretical size limit of the  
CWB. On a 64bit machine, it's actually the number of tokens that  
matters, since XML attributes are stored separately and do not count  
against the theoretical limit.

The reason for the limit is that internally, tokens are addressed  
with signed 32 bit integers, which translates into a maximum of 2^31  
= approx. 2.1 billion tokens. Unfortunately, there is no easy way to  
get around this limit. We could go up to 4 billion tokens if we used  
unsigned integers, but that would lead to enormous compatibility  
problems (inside the CWB as well as with other software using the CL  
library), and Web corpora will hit the 4 billion ceiling as well soon  
enough.

On our machine, I managed to encode the first 23 parts of ukwac-1.0  
with slightly above 2.1 billion tokens -- very close to the  
theoretical limit.

Best wishes,
Stefan