[CWB] CWB or other: does it meet the needs?

Tue Mar 11 14:04:12 CET 2008

Dear Sirs,

First of all, I'm very sorry if this kind of text is not to put in this
forum…

I belong to the Linguistics Center of the University of Lisbon (CLUL) which
is a department of interdisciplinary research, training and scientific
promotion, integrated in the University of Lisbon, in direct dependence of
the Faculdade de Letras.

I will briefly describe the corpora that we have compiled and the queries
that we want to be able to run, asking you a big favour: can you tell me,
with your long experience, which is the best database and interface that
demands our needs?

Our corpus is around 350M words (2,5 M spoken) and consists of a monitor
corpus (in the sense of John Sinclair). We collect all the materials that we
find available into this corpus without aiming for balance and
representativeness. Based on this monitor corpus, we have designed some
smaller corpora that are variety or genre specific and a 50M words balanced
corpus. A subpart of our corpus (1M) has been automatically tagged (POS) and
revised. You do not have a syntactically or semantically annotated corpus
for the moment, but it could be a development of the already existing tagged
corpus.

Does the database query engine requires the indexation of the corpus?

If so, is there any requirement on the files format?

Our corpus is being compiled since the 70s, so the files are in very
different formats (from txt, doc, to html, and others) and we need to assure
that any corpus management and query software can process different file
formats. We also want to assure a software that accommodates a large amount
of files, without requiring to put all the files together in a single one.
We would also need to know if the system requires the data to be tokenized
and, if so, if it incorporates this facility.

We have been using software designed at our research center, but the program
is lacking important functionalities and also a user-friendly quality
interface, and this leads us to seek other options of corpus management and
exploitation. We want a program which can be used either internally for our
research staff so as to manage and search the corpus, as well as externally
to give access to the corpus through our webpage.

The corpus users (either internally or externally via the web) must first
have the possibility to design the subcorpus over which they want to run the
search based on several fields like written, spoken, tagged, newspaper,
fiction, and even more specific searches like an author or all authors that
were born in the XIX century. A 50M words balanced corpus will also be
available if the user wishes to use a pre-designed corpus.

After designing or selecting the corpus, the user will define the search.
Besides the usual queries (frequencies and concordances of words, part of
words, regular expressions…, sort…), we would like to know if the system
would allow for queries on POS tags (we have different tagsets for different
corpora). Another question regards lemmatisation: would it be possible to
integrate our lemmatiser in the system in order to search for lemmas (based
on our lemmas/word forms database)?

We have done some work on collocations for Portuguese based on a 50M word
corpus and we would find interesting to be able to integrate this search
into the possible queries. Our software extracts n-grams from the corpus and
sorts the results according to the Mutual Information values. Would it be
possible?

In what concerns the response times, we believe it would be strongly
dependent on the corpus designed by the user. For external queries, we will
have to limit the corpus size to be queried since 350M words would probably
crash any query attempt on the internet (we have an indexed 11M words corpus
available for online queries right now and it does fine as long as no
sorting and no lemma query is asked for). For internal queries that may be
complex, no time limit is to be established, but for online queries, 10
seconds is probably a good time.

I hope this gives you a precise idea of our objectives and on the
possibility of using any possible system for this purpose. If you have some
doubts about something, please contact me so I can explain better our
concerns.

Best regards,

Pedro Sa
-------------- próxima parte ----------
Um anexo em HTML foi limpo...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20080311/7194fd81/attachment.htm