<p><span style="" lang="EN-GB">Dear Sirs,</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">First of all, I&#39;m very sorry if this

kind of text is not to put in this forum…</span></p>


<p><span style="" lang="EN-GB">I belong to the Linguistics Center

of the University of Lisbon (CLUL) which is a department of interdisciplinary

research, training and scientific promotion, integrated in the University of Lisbon, in direct dependence of the

Faculdade de Letras.</span></p>


<p><span style="" lang="EN-GB">I&nbsp;will briefly

describe the corpora that we have compiled and the queries that we want to be

able to run, asking you a big favour: can you tell me, with your long

experience, which is the best database and interface that demands our needs?</span></p>


<p><span style="" lang="EN-GB">Our corpus is around 350M

words (2,5 M spoken) and consists of a monitor corpus (in the sense of John

Sinclair). We collect all the materials that we find available into this corpus

without aiming for balance and representativeness. Based on this monitor

corpus, we have designed some smaller corpora that are variety or genre

specific and a 50M words balanced corpus. A subpart of our corpus (1M) has been

automatically tagged (POS) and revised. You do not have a syntactically or

semantically annotated corpus for the moment, but it could be a development of

the already existing tagged corpus.</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p style="margin: 0cm 0cm 0.0001pt;"><span style="" lang="EN-GB">Does the database query engine requires the indexation of the corpus?</span></p>


<p style="margin: 0cm 0cm 0.0001pt;"><span style="" lang="EN-GB">If so, is there any requirement on the files format?</span></p>


<p style="margin: 0cm 0cm 0.0001pt;"><span style="" lang="EN-GB">Our corpus is being compiled since the 70s, so the files are in very

different formats (from txt, doc, to html, and others) and we need to assure

that any corpus management and query software can process different file

formats. We also want to assure a software that accommodates a large amount of

files, without requiring to put all the files together in a single one. We

would also need to know if the system requires the data to be tokenized and, if

so, if it incorporates this facility.</span></p>


<p><span style="" lang="EN-GB">We have been using software

designed at our research center, but the program is lacking important

functionalities and also a user-friendly quality interface, and this leads us

to seek other options of corpus management and exploitation. We want a program

which can be used either internally for our research staff so as to manage and

search the corpus, as well as externally to give access to the corpus through

our webpage.</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">The corpus users (either

internally or externally via the web) must first have the possibility to design

the subcorpus over which they want to run the search based on several fields

like written, spoken, tagged, newspaper, fiction, and even more specific

searches like an author or all authors that were born in the XIX century. A 50M

words balanced corpus will also be available if the user wishes to use a

pre-designed corpus.</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">After designing or

selecting the corpus, the user will define the search. Besides the usual

queries (frequencies and concordances of words, part of words, regular

expressions…, sort…), we would like to know if the system would allow for

queries on POS tags (we have different tagsets for different corpora). Another

question regards lemmatisation: would it be possible to integrate our

lemmatiser in the system in order to search for lemmas (based on our

lemmas/word forms database)?</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">We have done some work on

collocations for Portuguese based on a 50M word corpus and we would find

interesting to be able to integrate this search into the possible queries. Our

software extracts n-grams from the corpus and sorts the results according to

the Mutual Information values. Would it be possible?</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">In what concerns the

response times, we believe it would be strongly dependent on the corpus

designed by the user. For external queries, we will have to limit the corpus

size to be queried since 350M words would probably crash any query attempt on

the internet (we have an indexed 11M words corpus available for online queries

right now and it does fine as long as no sorting and no lemma query is asked

for). For internal queries that may be complex, no time limit is to be

established, but for online queries, 10 seconds is probably a good time.</span><span style="font-size: 10pt; font-family: Arial;" lang="EN-GB"></span></p>


<p><span style="" lang="EN-GB">I hope this gives you a precise

idea of our objectives and on the possibility of using any possible system for

this purpose. If you have some doubts about something, please contact me so I

can explain better our concerns.</span></p>


<p><span style="" lang="EN-GB">Best regards,</span></p>


<p><span style="" lang="EN-GB">Pedro Sa</span></p>