[CWB] A question on CQP attribute sets
Ruprecht von Waldenfels
waldenfels at issl.unibe.ch
Tue Jul 10 15:50:04 CEST 2012
Hi Igor,
I have integrated this kind of annotation into CWB for the ParaSol
corpus (parasol.unibe.ch). The solution I used is similar and straight
forward; the main challenge, I think, is providing (a) for an
unspecified number of analyses and (b) making sure that the different
analyses don't get mixed up
For example, Russian "dam" EITHER 1.SG of the verb /dat'/ 'give' OR
GEN.PL of the noun /dama/ 'lady'; but it is not the first person
singular of the noun nor the Genitive Plural of the verb.
Therefore, I feel one must go for a rather complex annotation, which can
then be queried by using a regular expression. The machinery is
FORM ANNOTATION
dam 1:SG:PF-dat::GEN:PL-dama-
and then you can query for, say, Genitive by searching for
[annot=".*:GEN:.*"]
for /dama/ 'lady'
[annot=".*-dama-.*"]
for the combination (genitives of /dama/)
[annot=".*GEN[^-]*-dama-.*"]
The "[^-]*" part will ensure that the GEN part does not belong to a
different lemma.
I believe this might constitute a complete solution, and it should be
possible to hide the complexity from the user by wrapping this in a more
convenient interface.
Best,
Ruprecht
Am 10.07.2012 15:43, schrieb Serge Heiden:
> Hi Igor,
>
> One way to do this in CWB would be to split your
> pos and lemma values in several positionnal attributes.
> For example, in this way :
> form lemma1 lemma2 pos1 pos2 agr_set1 agr_set2
> sem_set1 sem_set2
>
> And force your queries to work coherently with
> corresponding attribute sets.
> Your example query would become :
> [lemma1=".*valuelemma.*" & pos1=".*valuepos.*"]
>
> What do you think ?
>
> Best,
> Serge
>
>
> le 10/07/2012 15:20 Selon ????? ?????????:
> > Hello!
> >
> > My name is Igor, I'm a developer of Russian National Corpus search
> > engine, and I'm trying to get it working with CWB. The main problem I
> > have is the following: RNC texts are annotated ambiguously for the
> > most part, and each word has got sets of lemmas, grammar and semantic
> > features, just as the GERMAN-LAW example in the tutorial. Suppose we
> > have a word:
> >
> > word lemma pos agr
> > sem
> >
> ------------------------------------------------------------------------------------------------------------------------
> >
> >
> form |lemma1|lemma2| |pos1|pos2| |agr_set1|agr_set2|
> |sem_set1|sem_set2|
> >
> > And, if I type the query:
> >
> > [(lemma contains "lemma1") and (pos contains "pos2")]
> >
> > I will get that very word matched, and this will be a mistake in my
> > case since there is only one strict correspondence: "lemma1 -> pos1
> > -> arg_set1 -> sem_set1", and the same for lemma2.
> >
> > So, my question, is there an out of the box possibility of performing
> > such queries (i.e., controlling positions of corresponding sets while
> > matching attribute sets with 'contains'), or it has to be
> > implemented?
> >
> > -- Best Regards, Igor Shalyminov
> > _______________________________________________ CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
--
------------------------------------------------
Ruprecht v. Waldenfels, waldenfels at issl.unibe.ch
Institut fuer slavische Sprachen und Literaturen
Universität Bern Laenggassstr. 49 CH 3005 Bern 9
Tel: +41 31 631 35 83 / Fax: +41 31 631 39 90
------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20120710/52aa11c7/attachment.htm
More information about the CWB
mailing list