[CWB] Corpus size and filtering

Rui Pedro Chaves ruipedrochaves at yahoo.com
Mon Mar 31 23:30:18 CEST 2008


Hi,
I've got a few general questions. 

I have to make available a corpus of about 400 million words online. Is there any known efficiency
issues with CWB when dealing with corpora this large that I should take into consideration?

Moreover, parts of the corpus contain XTML tags that are relevant for the researchers building the
corpus but which are not important for the queries of online users at large. In order to deal with
this, I suppose one would need a filter that operates on-the-fly and ignores these special-purpose
information, allowing users to search and visualize the corpora as if no code was there. Does CWB
support this kind of mechanism (and if not, can it be easily and efficiently implemented)?

Many thanks in advance,
Rui


      ____________________________________________________________________________________
Like movies? Here's a limited-time offer: Blockbuster Total Access for one month at no cost. 
http://tc.deals.yahoo.com/tc/blockbuster/text4.com


More information about the CWB mailing list