[CWB] Multi-word units

Tue Feb 19 12:45:28 CET 2013

Sorry. I should have considered this possibility. I apologize. I don't 
mean to rush you guys. Just some confirmation that this is a legitimate 
concern and that you think giving some thought to this question is worth 
it and we are satisfied. Just knowing that this or a similar solution 
might be possible to implement in the future would already be useful for 
us since it would justify making some particular choices right now.

One of the avenues we are considering if the solution we suggest were at 
all possible is to pre-process our tokenized texts so that all the 
possible multiword expressions would be added to the dictionary for our 
tager with the appropriate labels; something like:

Saint_Anselm, Sir_Lancelot_of_the_lake, in_order_to, etc.

This would at least allow the tagger to learn about the existing 
multi-word units and their distribution and tag the texts with what is 
for us the most important information. We want to parse the resulting 
corpus. So making the encoding of the specific syntactic information 
about relationships between different expressions as simple and 
intuitive as possible is our main concern at this point. Later, if the 
solution we suggest or a different one that achieved the same goals were 
available, we would reprocess the texts of the corpus to include the 
information about the different components of the multi-word expressions 
by eliminating the '_' and adding the new labels for the individual 
words via some encoding scheme possibly involving XML.

JM
>
> On my part it means it is a more complex question than I have had time 
> to write an email about yet!
>
> Andrew.
>
> *From:*cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Josep M. Fontana
> *Sent:* 19 February 2013 10:39
> *To:* cwb at sslmit.unibo.it
> *Subject:* Re: [CWB] Multi-word units
>
> Hi again,
>
> There hasn't been any reply to our previous message from anybody in 
> the list. Does this mean this problem has no possible solution within 
> CQP? Would the method we suggested be too hard or impossible to 
> implement? We would really appreciate your input because we have to 
> make decisions at this point on how we have to pre-process and 
> depending on the options we have with CQP we would go one way or 
> another. Thanks for all your help.
>
> Josep M
>
>     Hi Andrew and Stefan. I work with Eva and now it is my turn to
>     write. First thanks for your help.
>     Your answers has given us some ideas that we explain below. What
>     we don't really know is the potential pitfalls the implementation
>     we suggest would have for its processing via CQP. Below we'll try
>     to explain why we would want to do it like we are proposing.
>
>                 But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?
>
>         I was thinking of this kind of arrangement:
>
>           
>
>         apressurada        apressuradamientre
>
>         mientre    {some kind of ditto mark or just __NULL__}
>
>           
>
>         .... so that subsequent tokens on the two attributes stay in sync.
>
>           
>
>         OR, going the other way
>
>           
>
>         apressuradamientre apressurada mientre
>
>           
>
>         I'm quite open to alternatives, though the XML way strikes me as liable to cause trouble.
>
>
>     OK, first the reason Andrew's suggestion in (a) below, even though
>     it is less likely to cause problems, would be a bit less desirable
>     is that by having something like the following we would miss the
>     fact that the two words for all intents and purposes work as a
>     single unit. To give you an idea, this is exactly the same as if
>     in the same texts you would find strings like "hurriedly" and
>     "hurried ly". So, by default we want these multi-word expressions
>     to be found as a single unit any time a user searches for an
>     adverb or for the lemma 'apresuradamente'.
>
>     (a)
>
>     apressurada      apressuradamientre
>
>     mientre  {some kind of ditto mark or just __NULL__}
>
>
>     Andrew's suggestion in (b) below would overcome this problem but
>     then we don't really know how it could be implemented in CQP. What
>     we usually have in our tagged corpora are entries with 3 columns:
>     1) the form, 2) the lemma and 3) the POS tag. So (b) would be
>     problematic because there is apparently no way to say that the
>     lemma is in fact 'apresuradamente' and that "apressurada mientre"
>     is a multi-word instance/form of that lemma. Furthermore, for
>     reasons that have to do with the kind of research potential users
>     of this corpus are likely to do, it would be ideal to consider the
>     two parts of the multi-word expression also as two independent
>     words, each one with its lemma and its part of speech. This is so
>     because, in this particular example of adverbs with -mente, in the
>     early stages of the change that resulted in the creation of the
>     current manner adverbs, the strings with the two forms could have
>     been ambiguous between a single adverb (the interpretation we want
>     to be the default interpretation when doing a normal search) and
>     two independent words: one an adjective and the other a noun. So,
>     'apresurada' (which means 'hurried') is not a really good example
>     for this development but in the earlier stages of this change, the
>     string "fuerte mientre" (lit. "strong mind") could literally have
>     meant "with a strong mind" (I think the origins of adverbs with
>     -ly in English is similar) as well as "strongly". So we would like
>     for these expressions to be also searchable as two separate items
>     each one with its lemma and its POS in case a particular
>     researcher was interested in studying this phenomenon. For the
>     majority of researchers, though, the fact that the expression is
>     written in two separate words would not matter. For this reason,
>     we would like the default assumption in CQP was that there is a
>     single word.
>
>     (b)
>
>     apressuradamientre       apressurada mientre
>
>
>     Now, what Stefan proposed made us think of the following possibility:
>
>     <X>
>      word="apresurada mientre"    lemma="apresuradamente" pos="ADV"
>      <wp word="apresurada" lemma="apresurada" pos="ADJ"></wp>
>      <wp word="mientre" lemma="mente" pos="N"></wp>
>     </X>
>
>     We choose the label <X> for lack of a better one but the idea is
>     that by default CQP interpreted <X>....</X> as it interprets
>     entries for any single word. Then we would have an extra
>     p-attribute available <wp> (the users would know this) for cases
>     where a user was interested in doing stuff (just finding the
>     specific forms and their POS tag or doing some quantitative
>     analysis with the different parts) with the differentiated parts
>     of the expression.
>
>     Being able to do this is extremely important for diachronic
>     corpora but it would have advantages for all kinds of corpora
>     since all of them contain multi-word expressions where you might
>     need their components to be processed independently at some point.
>     So, in our corpora we have trouble not only with these types of
>     expressions but also with many others like the following:
>
>     "compte Guifré de Montblanc" This is a proper name literally
>     composed by the words count + Wilfred + of + Montblanc
>
>     In the texts you find independent instances of 'Guifré', 'compte'
>     or 'Montblanc'. What is most important is to be able to tag the
>     whole string as a noun. To do this is kind of trivial because you
>     could artificially create single strings of the type
>     'compte_Guifré_de_Montblanc' at the pre-processing stage and add
>     them to the dictionary as proper nouns. But then imagine that some
>     user is interested in studying the variation in the types of
>     prepositional phrases that occur within proper nouns, the place
>     names used in proper nouns of people or some such legitimate
>     research goal.
>
>     Having created a single word obscures all this information that
>     could be valuable for some. There are many more examples. Another
>     typical one are subordinating conjunctions formed by more than one
>     word (e.g. "Puis que" literally "since that"), etc. etc.  If you
>     give them to the tagger as independent words the resulting
>     sentence structure is grammatically weird because the two words
>     are really working as one (just like 'since') so it is better to
>     tag them as a single subordinating conjunction. Again, though,
>     people interested in doing research on how these combinations of
>     functional words evolved would loose all the information if you
>     tag them only as a single expression. I'm sure modern languages
>     have lots of cases like this.
>
>     You see what I mean? This is part of a more general problem with
>     linguistic annotation of corpora but it poses very specific
>     challenges for CWB/CQP which we would like to overcome if possible.
>
>     JM
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20130219/ad6c09e6/attachment.html>