[CWB] Corpus size and filtering

Fri Apr 4 02:00:16 CEST 2008

> I have to make available a corpus of about 400 million words  
> online. Is there any known efficiency
> issues with CWB when dealing with corpora this large that I should  
> take into consideration?

400 million words should just be okay, but you'll have to use the cwb- 
make Perl script to build the index on a 32-bit machine (because of  
memory limitations).  This is very close to the size limit that CWB  
can handle on 32-bit platforms, but if you have a 64-bit machine,  
there won't be any problems (we've tested up to the theoretical limit  
of 2 billion words).

A question to everyone else on the list: What's the largest corpus  
you've used with the CWB? Did you run into problems? At what size  
does performance begin to degrade?

> Moreover, parts of the corpus contain XTML tags that are relevant  
> for the researchers building the
> corpus but which are not important for the queries of online users  
> at large. In order to deal with
> this, I suppose one would need a filter that operates on-the-fly  
> and ignores these special-purpose
> information, allowing users to search and visualize the corpora as  
> if no code was there. Does CWB
> support this kind of mechanism (and if not, can it be easily and  
> efficiently implemented)?

Do I understand correctly that your corpus contains XML or XHTML tags  
that you normally don't want to display to users?

The CWB doesn't treat XML tags as separate tokens (if you've declared  
them properly), but simply remembers the start and end position of  
each XML region. By default, the tags won't be displayed at all, but  
you can switch on individual tags if you like, and you can use the  
region boundaries in CQP queries.

The CQP Query Language and Corpus Encoding tutorials have some more  
information on data formats and handling of XML tags.

Does this answer your questions?

Best wishes,
Stefan