[CWB] A question on CQP attribute sets

Ruprecht von Waldenfels waldenfels at issl.unibe.ch
Tue Jul 10 15:50:04 CEST 2012


Hi Igor,

I have integrated this kind of annotation into CWB for the ParaSol 
corpus (parasol.unibe.ch). The solution I used is similar and straight 
forward; the main challenge, I think, is providing (a) for an 
unspecified number of analyses and (b) making sure that the different 
analyses don't get mixed up

For example, Russian "dam" EITHER  1.SG of the verb /dat'/ 'give' OR 
GEN.PL of the noun /dama/ 'lady'; but it is not the first person 
singular of the noun nor the Genitive Plural of the verb.

Therefore, I feel one must go for a rather complex annotation, which can 
then be queried by using a regular expression. The machinery is

FORM ANNOTATION
dam  1:SG:PF-dat::GEN:PL-dama-

and then you can query for, say, Genitive by searching for

[annot=".*:GEN:.*"]

for /dama/ 'lady'

[annot=".*-dama-.*"]

for the  combination (genitives of /dama/)

[annot=".*GEN[^-]*-dama-.*"]

The "[^-]*" part will ensure that the GEN part does not belong to a 
different lemma.

I believe this might constitute a complete solution, and it should be 
possible to hide the complexity from the user by wrapping this in a more 
convenient interface.

Best,
Ruprecht











Am 10.07.2012 15:43, schrieb Serge Heiden:
> Hi Igor,
>
> One way to do this in CWB would be to split your
> pos and lemma values in several positionnal attributes.
> For example, in this way :
> form    lemma1    lemma2    pos1    pos2    agr_set1 agr_set2    
> sem_set1    sem_set2
>
> And force your queries to work coherently with
> corresponding attribute sets.
> Your example query would become :
> [lemma1=".*valuelemma.*" & pos1=".*valuepos.*"]
>
> What do you think ?
>
> Best,
> Serge
>
>
> le 10/07/2012 15:20 Selon ????? ?????????:
> > Hello!
> >
> > My name is Igor, I'm a developer of Russian National Corpus search
> > engine, and I'm trying to get it working with CWB. The main problem I
> > have is the following: RNC texts are annotated ambiguously for the
> > most part, and each word has got sets of lemmas, grammar and semantic
> > features, just as the GERMAN-LAW example in the tutorial. Suppose we
> > have a word:
> >
> > word lemma pos agr
> > sem
> > 
> ------------------------------------------------------------------------------------------------------------------------
> >
> >
> form    |lemma1|lemma2|    |pos1|pos2|    |agr_set1|agr_set2| 
> |sem_set1|sem_set2|
> >
> > And, if I type the query:
> >
> > [(lemma contains "lemma1") and (pos contains "pos2")]
> >
> > I will get that very word matched, and this will be a mistake in my
> > case since there is only one strict correspondence: "lemma1 -> pos1
> > -> arg_set1 -> sem_set1", and the same for lemma2.
> >
> > So, my question, is there an out of the box possibility of performing
> > such queries (i.e., controlling positions of corresponding sets while
> > matching attribute sets with 'contains'), or it has to be
> > implemented?
> >
> > -- Best Regards, Igor Shalyminov
> > _______________________________________________ CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
> -- 
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb


-- 
------------------------------------------------
Ruprecht v. Waldenfels, waldenfels at issl.unibe.ch
Institut fuer slavische Sprachen und Literaturen
Universität Bern Laenggassstr. 49 CH 3005 Bern 9
Tel: +41  31 631 35 83 /  Fax: +41 31  631 39 90
------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120710/52aa11c7/attachment.htm


More information about the CWB mailing list