[CWB] Corpus size and filtering

Mon Mar 31 23:30:18 CEST 2008

Hi,
I've got a few general questions. 

I have to make available a corpus of about 400 million words online. Is there any known efficiency
issues with CWB when dealing with corpora this large that I should take into consideration?

Moreover, parts of the corpus contain XTML tags that are relevant for the researchers building the
corpus but which are not important for the queries of online users at large. In order to deal with
this, I suppose one would need a filter that operates on-the-fly and ignores these special-purpose
information, allowing users to search and visualize the corpora as if no code was there. Does CWB
support this kind of mechanism (and if not, can it be easily and efficiently implemented)?

Many thanks in advance,
Rui

      ____________________________________________________________________________________
Like movies? Here's a limited-time offer: Blockbuster Total Access for one month at no cost. 
http://tc.deals.yahoo.com/tc/blockbuster/text4.com