[CWB] Corpus size and filtering
Stefan Evert
stefan.evert at uos.de
Fri Apr 4 02:00:16 CEST 2008
> I have to make available a corpus of about 400 million words
> online. Is there any known efficiency
> issues with CWB when dealing with corpora this large that I should
> take into consideration?
400 million words should just be okay, but you'll have to use the cwb-
make Perl script to build the index on a 32-bit machine (because of
memory limitations). This is very close to the size limit that CWB
can handle on 32-bit platforms, but if you have a 64-bit machine,
there won't be any problems (we've tested up to the theoretical limit
of 2 billion words).
A question to everyone else on the list: What's the largest corpus
you've used with the CWB? Did you run into problems? At what size
does performance begin to degrade?
> Moreover, parts of the corpus contain XTML tags that are relevant
> for the researchers building the
> corpus but which are not important for the queries of online users
> at large. In order to deal with
> this, I suppose one would need a filter that operates on-the-fly
> and ignores these special-purpose
> information, allowing users to search and visualize the corpora as
> if no code was there. Does CWB
> support this kind of mechanism (and if not, can it be easily and
> efficiently implemented)?
Do I understand correctly that your corpus contains XML or XHTML tags
that you normally don't want to display to users?
The CWB doesn't treat XML tags as separate tokens (if you've declared
them properly), but simply remembers the start and end position of
each XML region. By default, the tags won't be displayed at all, but
you can switch on individual tags if you like, and you can use the
region boundaries in CQP queries.
The CQP Query Language and Corpus Encoding tutorials have some more
information on data formats and handling of XML tags.
Does this answer your questions?
Best wishes,
Stefan
More information about the CWB
mailing list