[CWB] Multi-word units

Josep M. Fontana josepm.fontana at upf.edu
Wed Feb 20 09:56:05 CET 2013


Thank you very much, Diana.
We will have a close look at the information you send us. It looks like 
you guys have confronted many of the problems we have found besides the 
issue of the multi-word expressions. Of course, Portuguese has those 
pesky clitics as well and those have caused us some headaches, too. As I 
said, we'll study the solutions you have adopted and come back with some 
questions if there is something we don't understand. Again, thanks for 
your help.

JM
> Sorry that I have not answered before, I thought about it and actually
> intended to.
> You may want to look at how the AC/DC project
> (http://www.linguateca.pt/ACDC/) solved this and related issues, we
> are happy with the solutions we came up with and I think this is true
> about our users.
> See our 2000 paper in LREC for a start, and the documentation in the
> Web pages. It is of course a bronze age CWB we were dealing with, but
> I hope you will be able to update the solutions if you like them.
>
> best
> Diana
>
> Diana Santos & Eckhard Bick. "Providing Internet access to Portuguese
> corpora: the AC/DC project". In Maria Gavrilidou, George Carayannis,
> Stella Markantonatou, Stelios Piperidis & Gregory Stainhauer (eds.),
> Proceedings of the Second International Conference on Language
> Resources and Evaluation (LREC 2000) (Atenas, Grécia, 31 de Maio a 2
> de Junho de 2000), pp. 205-210.
> http://www.linguateca.pt/documentos/SantosBickLREC2000.pdf
>
>
> 2013/2/19 Josep M. Fontana <josepm.fontana at upf.edu>:
>> Sorry. I should have considered this possibility. I apologize. I don't mean
>> to rush you guys. Just some confirmation that this is a legitimate concern
>> and that you think giving some thought to this question is worth it and we
>> are satisfied. Just knowing that this or a similar solution might be
>> possible to implement in the future would already be useful for us since it
>> would justify making some particular choices right now.
>>
>> One of the avenues we are considering if the solution we suggest were at all
>> possible is to pre-process our tokenized texts so that all the possible
>> multiword expressions would be added to the dictionary for our tager with
>> the appropriate labels; something like:
>>
>> Saint_Anselm, Sir_Lancelot_of_the_lake, in_order_to, etc.
>>
>> This would at least allow the tagger to learn about the existing multi-word
>> units and their distribution and tag the texts with what is for us the most
>> important information. We want to parse the resulting corpus. So making the
>> encoding of the specific syntactic information about relationships between
>> different expressions as simple and intuitive as possible is our main
>> concern at this point. Later, if the solution we suggest or a different one
>> that achieved the same goals were available, we would reprocess the texts of
>> the corpus to include the information about the different components of the
>> multi-word expressions by eliminating the '_' and adding the new labels for
>> the individual words via some encoding scheme possibly involving XML.
>>
>>
>> JM
>>
>> On my part it means it is a more complex question than I have had time to
>> write an email about yet!
>>
>>
>>
>> Andrew.
>>
>>
>>
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
>> Behalf Of Josep M. Fontana
>> Sent: 19 February 2013 10:39
>> To: cwb at sslmit.unibo.it
>> Subject: Re: [CWB] Multi-word units
>>
>>
>>
>> Hi again,
>>
>> There hasn't been any reply to our previous message from anybody in the
>> list. Does this mean this problem has no possible solution within CQP? Would
>> the method we suggested be too hard or impossible to implement? We would
>> really appreciate your input because we have to make decisions at this point
>> on how we have to pre-process and depending on the options we have with CQP
>> we would go one way or another. Thanks for all your help.
>>
>> Josep M
>>
>> Hi Andrew and Stefan. I work with Eva and now it is my turn to write. First
>> thanks for your help.
>> Your answers has given us some ideas that we explain below. What we don't
>> really know is the potential pitfalls the implementation we suggest would
>> have for its processing via CQP. Below we'll try to explain why we would
>> want to do it like we are proposing.
>>
>> But this would break the alignment between the two attributes, if one has
>> two tokens and the other only a single token, wouldn't it?
>>
>> I was thinking of this kind of arrangement:
>>
>>
>>
>> apressurada        apressuradamientre
>>
>> mientre    {some kind of ditto mark or just __NULL__}
>>
>>
>>
>> .... so that subsequent tokens on the two attributes stay in sync.
>>
>>
>>
>> OR, going the other way
>>
>>
>>
>> apressuradamientre apressurada mientre
>>
>>
>>
>> I'm quite open to alternatives, though the XML way strikes me as liable to
>> cause trouble.
>>
>>
>> OK, first the reason Andrew's suggestion in (a) below, even though it is
>> less likely to cause problems, would be a bit less desirable is that by
>> having something like the following we would miss the fact that the two
>> words for all intents and purposes work as a single unit. To give you an
>> idea, this is exactly the same as if in the same texts you would find
>> strings like "hurriedly" and "hurried ly". So, by default we want these
>> multi-word expressions to be found as a single unit any time a user searches
>> for an adverb or for the lemma 'apresuradamente'.
>>
>> (a)
>>
>> apressurada      apressuradamientre
>>
>> mientre  {some kind of ditto mark or just __NULL__}
>>
>>
>> Andrew's suggestion in (b) below would overcome this problem but then we
>> don't really know how it could be implemented in CQP. What we usually have
>> in our tagged corpora are entries with 3 columns: 1) the form, 2) the lemma
>> and 3) the POS tag. So (b) would be problematic because there is apparently
>> no way to say that the lemma is in fact 'apresuradamente' and that
>> "apressurada mientre" is a multi-word instance/form of that lemma.
>> Furthermore, for reasons that have to do with the kind of research potential
>> users of this corpus are likely to do, it would be ideal to consider the two
>> parts of the multi-word expression also as two independent words, each one
>> with its lemma and its part of speech. This is so because, in this
>> particular example of adverbs with -mente, in the early stages of the change
>> that resulted in the creation of the current manner adverbs, the strings
>> with the two forms could have been ambiguous between a single adverb (the
>> interpretation we want to be the default interpretation when doing a normal
>> search) and two independent words: one an adjective and the other a noun.
>> So, 'apresurada' (which means 'hurried') is not a really good example for
>> this development but in the earlier stages of this change, the string
>> "fuerte mientre" (lit. "strong mind") could literally have meant "with a
>> strong mind" (I think the origins of adverbs with -ly in English is similar)
>> as well as "strongly". So we would like for these expressions to be also
>> searchable as two separate items each one with its lemma and its POS in case
>> a particular researcher was interested in studying this phenomenon. For the
>> majority of researchers, though, the fact that the expression is written in
>> two separate words would not matter. For this reason, we would like the
>> default assumption in CQP was that there is a single word.
>>
>> (b)
>>
>> apressuradamientre       apressurada mientre
>>
>>
>> Now, what Stefan proposed made us think of the following possibility:
>>
>> <X>
>>   word="apresurada mientre"    lemma="apresuradamente"  pos="ADV"
>>   <wp word="apresurada" lemma="apresurada" pos="ADJ"></wp>
>>   <wp word="mientre" lemma="mente" pos="N"></wp>
>> </X>
>>
>> We choose the label <X> for lack of a better one but the idea is that by
>> default CQP interpreted <X>....</X> as it interprets entries for any single
>> word. Then we would have an extra p-attribute available <wp> (the users
>> would know this) for cases where a user was interested in doing stuff (just
>> finding the specific forms and their POS tag or doing some quantitative
>> analysis with the different parts) with the differentiated parts of the
>> expression.
>>
>> Being able to do this is extremely important for diachronic corpora but it
>> would have advantages for all kinds of corpora since all of them contain
>> multi-word expressions where you might need their components to be processed
>> independently at some point. So, in our corpora we have trouble not only
>> with these types of expressions but also with many others like the
>> following:
>>
>> "compte Guifré de Montblanc" This is a proper name literally composed by the
>> words count + Wilfred + of + Montblanc
>>
>> In the texts you find independent instances of 'Guifré', 'compte' or
>> 'Montblanc'. What is most important is to be able to tag the whole string as
>> a noun. To do this is kind of trivial because you could artificially create
>> single strings of the type 'compte_Guifré_de_Montblanc' at the
>> pre-processing stage and add them to the dictionary as proper nouns. But
>> then imagine that some user is interested in studying the variation in the
>> types of prepositional phrases that occur within proper nouns, the place
>> names used in proper nouns of people or some such legitimate research goal.
>>
>> Having created a single word obscures all this information that could be
>> valuable for some. There are many more examples. Another typical one are
>> subordinating conjunctions formed by more than one word (e.g. "Puis que"
>> literally "since that"), etc. etc.  If you give them to the tagger as
>> independent words the resulting sentence structure is grammatically weird
>> because the two words are really working as one (just like 'since') so it is
>> better to tag them as a single subordinating conjunction. Again, though,
>> people interested in doing research on how these combinations of functional
>> words evolved would loose all the information if you tag them only as a
>> single expression. I'm sure modern languages have lots of cases like this.
>>
>> You see what I mean? This is part of a more general problem with linguistic
>> annotation of corpora but it poses very specific challenges for CWB/CQP
>> which we would like to overcome if possible.
>>
>> JM
>>
>>
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list