[CWB] Multi-word units

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Feb 14 22:44:37 CET 2013


Hi Eva,

If I understand the problem correctly - the normal way that I would do this would be to encode the original orthography (with as-is token breaks) and a normalised orthography (with normalised token breaks) as two separate attributes (either 2 p-attributes, or one p-attribute with the normalised version and one s-attribute with the original version).

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of BOFÍAS ALBERCH, EVA
Sent: 14 February 2013 18:53
To: cwb at sslmit.unibo.it
Subject: [CWB] Multi-word units

Hi,

I don't know whether this is possible at all but it doesn't hurt to ask. OK, here's the problem we have. We are developing a corpus to be exploited via CQP and we would like future users to access information in different ways. This is a diachronic corpus and sometimes it is important to know what parts a given multi-word expression has. So for instance in Old Spanish we  have expressions such as "apressurada mientre" ('mientre' is the equivalent to the English -ly) which are clearly working as their contemporary Spanish equivalent expressions: "apresuradamente". It is important to encode this as a single word marked as 'adverb' but some potential users might be interested in studying the evolution of these forms and might want to distinguish between forms that the scribes wrote as a single word (the same texts also have these adverbs with "mente" as single words) from the ones that are written as two different words. The idea would be to find some way of coding the corpus so that multiword expressions such as these ones could be tagged as a single word but if a user wanted to find all the instances of 'mientre' independently of whether it is attached to the preceding word or not s/he would be able to do it as well. Any suggestions? Or are we asking for something that is not possible?

Eva































-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20130214/647dc2c0/attachment.html>


More information about the CWB mailing list