[CWB] CWB or other: does it meet the needs?

Sun Apr 20 18:00:36 CEST 2008

Dear Pedro,

sorry for the late reply ... as you can see, we're still a rather  
small group on this list and it's usually fairly quiet when people  
are too busy to check the list.

 From your detailed list of requirements, I suspect that you're  
really looking for a commercial off-the-shelf solution. Modern  
relational databases (like Oracle) offer fairly sophisticated full- 
text search and have converters for all the file formats you mention.  
I don't think you would be able to perform POS and lemma queries in  
such a database, though. You might also be able to get a special  
service from the Sketch Engine adapted to your requirements. I  
suppose that both options would cost you a few thousand euros a year,  
but that's just a wild guess.

The Corpus Workbench does in principle support all your query  
requirements, but it is a specialised indexing and corpus query  
engine. It does not include any of the other components you might  
want in a corpus environment (perhaps you've been mislead into  
thinking that because of its name): there are no input converters for  
different file formats, no GUI or Web interface, no database  
integration, no collocation feature etc. You will have to build such  
an environment yourself around the query engine.

The input format for the CWB is text files in one-word-per-line form,  
with linguistic annotations (typically POS and lemma) as additional  
TAB-delimited fields and XML tags for basic structural annotation on  
separate lines. If you can convert your corpus data to this format  
(i.e. convert .doc and .html to plain text, tokenise it, annotate  
with part-of-speech tags and add lemmas from your lemmatiser), then  
you can easily read it into the CWB and query it with CQP.

Of course, a number of groups have implemented Web interfaces /  
environments for their own purposes, or are working on such  
solutions. Most of these people are willing to share their code, but  
they cannot offer a complete off-the-shelf solution or a technical  
support hotline.

It might be a good idea to post your question on the corpora list --  
they will give you more details on Sketch Engine and other commercial  
options, and someone might know about an open-source system that  
meets your needs.

Kind regards,
Stefan Evert

> First of all, I'm very sorry if this kind of text is not to put in  
> this forum…
>
> I belong to the Linguistics Center of the University of Lisbon  
> (CLUL) which is a department of interdisciplinary research,  
> training and scientific promotion, integrated in the University of  
> Lisbon, in direct dependence of the Faculdade de Letras.
>
> I will briefly describe the corpora that we have compiled and the  
> queries that we want to be able to run, asking you a big favour:  
> can you tell me, with your long experience, which is the best  
> database and interface that demands our needs?
>
> Our corpus is around 350M words (2,5 M spoken) and consists of a  
> monitor corpus (in the sense of John Sinclair). We collect all the  
> materials that we find available into this corpus without aiming  
> for balance and representativeness. Based on this monitor corpus,  
> we have designed some smaller corpora that are variety or genre  
> specific and a 50M words balanced corpus. A subpart of our corpus  
> (1M) has been automatically tagged (POS) and revised. You do not  
> have a syntactically or semantically annotated corpus for the  
> moment, but it could be a development of the already existing  
> tagged corpus.
>
> Does the database query engine requires the indexation of the corpus?
> If so, is there any requirement on the files format?
> Our corpus is being compiled since the 70s, so the files are in  
> very different formats (from txt, doc, to html, and others) and we  
> need to assure that any corpus management and query software can  
> process different file formats. We also want to assure a software  
> that accommodates a large amount of files, without requiring to put  
> all the files together in a single one. We would also need to know  
> if the system requires the data to be tokenized and, if so, if it  
> incorporates this facility.
> We have been using software designed at our research center, but  
> the program is lacking important functionalities and also a user- 
> friendly quality interface, and this leads us to seek other options  
> of corpus management and exploitation. We want a program which can  
> be used either internally for our research staff so as to manage  
> and search the corpus, as well as externally to give access to the  
> corpus through our webpage.
>
> The corpus users (either internally or externally via the web) must  
> first have the possibility to design the subcorpus over which they  
> want to run the search based on several fields like written,  
> spoken, tagged, newspaper, fiction, and even more specific searches  
> like an author or all authors that were born in the XIX century. A  
> 50M words balanced corpus will also be available if the user wishes  
> to use a pre-designed corpus.
>
> After designing or selecting the corpus, the user will define the  
> search. Besides the usual queries (frequencies and concordances of  
> words, part of words, regular expressions…, sort…), we would like  
> to know if the system would allow for queries on POS tags (we have  
> different tagsets for different corpora). Another question regards  
> lemmatisation: would it be possible to integrate our lemmatiser in  
> the system in order to search for lemmas (based on our lemmas/word  
> forms database)?
>
> We have done some work on collocations for Portuguese based on a  
> 50M word corpus and we would find interesting to be able to  
> integrate this search into the possible queries. Our software  
> extracts n-grams from the corpus and sorts the results according to  
> the Mutual Information values. Would it be possible?
>
> In what concerns the response times, we believe it would be  
> strongly dependent on the corpus designed by the user. For external  
> queries, we will have to limit the corpus size to be queried since  
> 350M words would probably crash any query attempt on the internet  
> (we have an indexed 11M words corpus available for online queries  
> right now and it does fine as long as no sorting and no lemma query  
> is asked for). For internal queries that may be complex, no time  
> limit is to be established, but for online queries, 10 seconds is  
> probably a good time.
>
> I hope this gives you a precise idea of our objectives and on the  
> possibility of using any possible system for this purpose. If you  
> have some doubts about something, please contact me so I can  
> explain better our concerns.
>