[CWB] Multi-word units

Josep M. Fontana josepm.fontana at upf.edu
Fri Feb 15 18:45:50 CET 2013


Hi Andrew and Stefan. I work with Eva and now it is my turn to write. 
First thanks for your help.
Your answers has given us some ideas that we explain below. What we 
don't really know is the potential pitfalls the implementation we 
suggest would have for its processing via CQP. Below we'll try to 
explain why we would want to do it like we are proposing.

>>> But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?
> I was thinking of this kind of arrangement:
>
> apressurada	apressuradamientre
> mientre	{some kind of ditto mark or just __NULL__}
>
> .... so that subsequent tokens on the two attributes stay in sync.
>
> OR, going the other way
>
> apressuradamientre	apressurada mientre
>
> I'm quite open to alternatives, though the XML way strikes me as liable to cause trouble.

OK, first the reason Andrew's suggestion in (a) below, even though it is 
less likely to cause problems, would be a bit less desirable is that by 
having something like the following we would miss the fact that the two 
words for all intents and purposes work as a single unit. To give you an 
idea, this is exactly the same as if in the same texts you would find 
strings like "hurriedly" and "hurried ly". So, by default we want these 
multi-word expressions to be found as a single unit any time a user 
searches for an adverb or for the lemma 'apresuradamente'.

(a)

apressurada	apressuradamientre
mientre	{some kind of ditto mark or just __NULL__}


Andrew's suggestion in (b) below would overcome this problem but then we 
don't really know how it could be implemented in CQP. What we usually 
have in our tagged corpora are entries with 3 columns: 1) the form, 2) 
the lemma and 3) the POS tag. So (b) would be problematic because there 
is apparently no way to say that the lemma is in fact 'apresuradamente' 
and that "apressurada mientre" is a multi-word instance/form of that 
lemma. Furthermore, for reasons that have to do with the kind of 
research potential users of this corpus are likely to do, it would be 
ideal to consider the two parts of the multi-word expression also as two 
independent words, each one with its lemma and its part of speech. This 
is so because, in this particular example of adverbs with -mente, in the 
early stages of the change that resulted in the creation of the current 
manner adverbs, the strings with the two forms could have been ambiguous 
between a single adverb (the interpretation we want to be the default 
interpretation when doing a normal search) and two independent words: 
one an adjective and the other a noun. So, 'apresurada' (which means 
'hurried') is not a really good example for this development but in the 
earlier stages of this change, the string "fuerte mientre" (lit. "strong 
mind") could literally have meant "with a strong mind" (I think the 
origins of adverbs with -ly in English is similar) as well as 
"strongly". So we would like for these expressions to be also searchable 
as two separate items each one with its lemma and its POS in case a 
particular researcher was interested in studying this phenomenon. For 
the majority of researchers, though, the fact that the expression is 
written in two separate words would not matter. For this reason, we 
would like the default assumption in CQP was that there is a single word.

(b)

apressuradamientre	apressurada mientre


Now, what Stefan proposed made us think of the following possibility:

<X>
  word="apresurada mientre"    lemma="apresuradamente"  pos="ADV"
  <wp word="apresurada" lemma="apresurada" pos="ADJ"></wp>
  <wp word="mientre" lemma="mente" pos="N"></wp>
</X>

We choose the label <X> for lack of a better one but the idea is that by 
default CQP interpreted <X>....</X> as it interprets entries for any 
single word. Then we would have an extra p-attribute available <wp> (the 
users would know this) for cases where a user was interested in doing 
stuff (just finding the specific forms and their POS tag or doing some 
quantitative analysis with the different parts) with the differentiated 
parts of the expression.

Being able to do this is extremely important for diachronic corpora but 
it would have advantages for all kinds of corpora since all of them 
contain multi-word expressions where you might need their components to 
be processed independently at some point. So, in our corpora we have 
trouble not only with these types of expressions but also with many 
others like the following:

"compte Guifré de Montblanc" This is a proper name literally composed by 
the words count + Wilfred + of + Montblanc

In the texts you find independent instances of 'Guifré', 'compte' or 
'Montblanc'. What is most important is to be able to tag the whole 
string as a noun. To do this is kind of trivial because you could 
artificially create single strings of the type 
'compte_Guifré_de_Montblanc' at the pre-processing stage and add them to 
the dictionary as proper nouns. But then imagine that some user is 
interested in studying the variation in the types of prepositional 
phrases that occur within proper nouns, the place names used in proper 
nouns of people or some such legitimate research goal.

Having created a single word obscures all this information that could be 
valuable for some. There are many more examples. Another typical one are 
subordinating conjunctions formed by more than one word (e.g. "Puis que" 
literally "since that"), etc. etc.  If you give them to the tagger as 
independent words the resulting sentence structure is grammatically 
weird because the two words are really working as one (just like 
'since') so it is better to tag them as a single subordinating 
conjunction. Again, though, people interested in doing research on how 
these combinations of functional words evolved would loose all the 
information if you tag them only as a single expression. I'm sure modern 
languages have lots of cases like this.

You see what I mean? This is part of a more general problem with 
linguistic annotation of corpora but it poses very specific challenges 
for CWB/CQP which we would like to overcome if possible.

JM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20130215/77eae5b6/attachment.html>


More information about the CWB mailing list