[CWB] Multi-word units

Tue Feb 19 13:45:40 CET 2013

Sorry that I have not answered before, I thought about it and actually
intended to.
You may want to look at how the AC/DC project
(http://www.linguateca.pt/ACDC/) solved this and related issues, we
are happy with the solutions we came up with and I think this is true
about our users.
See our 2000 paper in LREC for a start, and the documentation in the
Web pages. It is of course a bronze age CWB we were dealing with, but
I hope you will be able to update the solutions if you like them.

best
Diana

Diana Santos & Eckhard Bick. "Providing Internet access to Portuguese
corpora: the AC/DC project". In Maria Gavrilidou, George Carayannis,
Stella Markantonatou, Stelios Piperidis & Gregory Stainhauer (eds.),
Proceedings of the Second International Conference on Language
Resources and Evaluation (LREC 2000) (Atenas, Grécia, 31 de Maio a 2
de Junho de 2000), pp. 205-210.
http://www.linguateca.pt/documentos/SantosBickLREC2000.pdf

2013/2/19 Josep M. Fontana <josepm.fontana  upf.edu>:
> Sorry. I should have considered this possibility. I apologize. I don't mean
> to rush you guys. Just some confirmation that this is a legitimate concern
> and that you think giving some thought to this question is worth it and we
> are satisfied. Just knowing that this or a similar solution might be
> possible to implement in the future would already be useful for us since it
> would justify making some particular choices right now.
>
> One of the avenues we are considering if the solution we suggest were at all
> possible is to pre-process our tokenized texts so that all the possible
> multiword expressions would be added to the dictionary for our tager with
> the appropriate labels; something like:
>
> Saint_Anselm, Sir_Lancelot_of_the_lake, in_order_to, etc.
>
> This would at least allow the tagger to learn about the existing multi-word
> units and their distribution and tag the texts with what is for us the most
> important information. We want to parse the resulting corpus. So making the
> encoding of the specific syntactic information about relationships between
> different expressions as simple and intuitive as possible is our main
> concern at this point. Later, if the solution we suggest or a different one
> that achieved the same goals were available, we would reprocess the texts of
> the corpus to include the information about the different components of the
> multi-word expressions by eliminating the '_' and adding the new labels for
> the individual words via some encoding scheme possibly involving XML.
>
>
> JM
>
> On my part it means it is a more complex question than I have had time to
> write an email about yet!
>
>
>
> Andrew.
>
>
>
> From: cwb-bounces  sslmit.unibo.it [mailto:cwb-bounces  sslmit.unibo.it] On
> Behalf Of Josep M. Fontana
> Sent: 19 February 2013 10:39
> To: cwb  sslmit.unibo.it
> Subject: Re: [CWB] Multi-word units
>
>
>
> Hi again,
>
> There hasn't been any reply to our previous message from anybody in the
> list. Does this mean this problem has no possible solution within CQP? Would
> the method we suggested be too hard or impossible to implement? We would
> really appreciate your input because we have to make decisions at this point
> on how we have to pre-process and depending on the options we have with CQP
> we would go one way or another. Thanks for all your help.
>
> Josep M
>
> Hi Andrew and Stefan. I work with Eva and now it is my turn to write. First
> thanks for your help.
> Your answers has given us some ideas that we explain below. What we don't
> really know is the potential pitfalls the implementation we suggest would
> have for its processing via CQP. Below we'll try to explain why we would
> want to do it like we are proposing.
>
> But this would break the alignment between the two attributes, if one has
> two tokens and the other only a single token, wouldn't it?
>
> I was thinking of this kind of arrangement:
>
>
>
> apressurada        apressuradamientre
>
> mientre    {some kind of ditto mark or just __NULL__}
>
>
>
> .... so that subsequent tokens on the two attributes stay in sync.
>
>
>
> OR, going the other way
>
>
>
> apressuradamientre apressurada mientre
>
>
>
> I'm quite open to alternatives, though the XML way strikes me as liable to
> cause trouble.
>
>
> OK, first the reason Andrew's suggestion in (a) below, even though it is
> less likely to cause problems, would be a bit less desirable is that by
> having something like the following we would miss the fact that the two
> words for all intents and purposes work as a single unit. To give you an
> idea, this is exactly the same as if in the same texts you would find
> strings like "hurriedly" and "hurried ly". So, by default we want these
> multi-word expressions to be found as a single unit any time a user searches
> for an adverb or for the lemma 'apresuradamente'.
>
> (a)
>
> apressurada      apressuradamientre
>
> mientre  {some kind of ditto mark or just __NULL__}
>
>
> Andrew's suggestion in (b) below would overcome this problem but then we
> don't really know how it could be implemented in CQP. What we usually have
> in our tagged corpora are entries with 3 columns: 1) the form, 2) the lemma
> and 3) the POS tag. So (b) would be problematic because there is apparently
> no way to say that the lemma is in fact 'apresuradamente' and that
> "apressurada mientre" is a multi-word instance/form of that lemma.
> Furthermore, for reasons that have to do with the kind of research potential
> users of this corpus are likely to do, it would be ideal to consider the two
> parts of the multi-word expression also as two independent words, each one
> with its lemma and its part of speech. This is so because, in this
> particular example of adverbs with -mente, in the early stages of the change
> that resulted in the creation of the current manner adverbs, the strings
> with the two forms could have been ambiguous between a single adverb (the
> interpretation we want to be the default interpretation when doing a normal
> search) and two independent words: one an adjective and the other a noun.
> So, 'apresurada' (which means 'hurried') is not a really good example for
> this development but in the earlier stages of this change, the string
> "fuerte mientre" (lit. "strong mind") could literally have meant "with a
> strong mind" (I think the origins of adverbs with -ly in English is similar)
> as well as "strongly". So we would like for these expressions to be also
> searchable as two separate items each one with its lemma and its POS in case
> a particular researcher was interested in studying this phenomenon. For the
> majority of researchers, though, the fact that the expression is written in
> two separate words would not matter. For this reason, we would like the
> default assumption in CQP was that there is a single word.
>
> (b)
>
> apressuradamientre       apressurada mientre
>
>
> Now, what Stefan proposed made us think of the following possibility:
>
> <X>
>  word="apresurada mientre"    lemma="apresuradamente"  pos="ADV"
>  <wp word="apresurada" lemma="apresurada" pos="ADJ"></wp>
>  <wp word="mientre" lemma="mente" pos="N"></wp>
> </X>
>
> We choose the label <X> for lack of a better one but the idea is that by
> default CQP interpreted <X>....</X> as it interprets entries for any single
> word. Then we would have an extra p-attribute available <wp> (the users
> would know this) for cases where a user was interested in doing stuff (just
> finding the specific forms and their POS tag or doing some quantitative
> analysis with the different parts) with the differentiated parts of the
> expression.
>
> Being able to do this is extremely important for diachronic corpora but it
> would have advantages for all kinds of corpora since all of them contain
> multi-word expressions where you might need their components to be processed
> independently at some point. So, in our corpora we have trouble not only
> with these types of expressions but also with many others like the
> following:
>
> "compte Guifré de Montblanc" This is a proper name literally composed by the
> words count + Wilfred + of + Montblanc
>
> In the texts you find independent instances of 'Guifré', 'compte' or
> 'Montblanc'. What is most important is to be able to tag the whole string as
> a noun. To do this is kind of trivial because you could artificially create
> single strings of the type 'compte_Guifré_de_Montblanc' at the
> pre-processing stage and add them to the dictionary as proper nouns. But
> then imagine that some user is interested in studying the variation in the
> types of prepositional phrases that occur within proper nouns, the place
> names used in proper nouns of people or some such legitimate research goal.
>
> Having created a single word obscures all this information that could be
> valuable for some. There are many more examples. Another typical one are
> subordinating conjunctions formed by more than one word (e.g. "Puis que"
> literally "since that"), etc. etc.  If you give them to the tagger as
> independent words the resulting sentence structure is grammatically weird
> because the two words are really working as one (just like 'since') so it is
> better to tag them as a single subordinating conjunction. Again, though,
> people interested in doing research on how these combinations of functional
> words evolved would loose all the information if you tag them only as a
> single expression. I'm sure modern languages have lots of cases like this.
>
> You see what I mean? This is part of a more general problem with linguistic
> annotation of corpora but it poses very specific challenges for CWB/CQP
> which we would like to overcome if possible.
>
> JM
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB  sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB  sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>