[CWB] [ cwb-Feature Requests-2966922 ] CQPweb: Annotate query

Mon Aug 1 02:07:48 CEST 2011

Hi Yannick,

thanks very much for the suggestions, especially re: adding annotation fields automatically. I have added a note on this to the feature request thread.

This request has actually been on the books for quite a while, and I have yet to start implementing it due to the presence of more urgent things on the TODO list for CQPweb. So if anyone else has design suggestions, feel free to make them, either on the list or directly into the FR database:

https://sourceforge.net/tracker/index.php?func=detail&aid=2966922&group_id=131809&atid=722306 

... as it is much easier to expand the design before programming work starts than to go back and rework things afterwards!

(This FR was reposted to the list, incidentally, because a spam-bot was probing the bug database input form on sourceforge. This happened quite a few times in June, but seems to have died off now. If it becomes a problem again, and is really getting on people's nerves, we can stop sourceforge tracker updates from being sent to the list. -- Of course, that said, I've just now had to commit much worse list-abuse in my sorting-out of the items in the tracker, but this kind of spring-clean is only necessary once every 3 or 4 years, so hopefully no one was too annoyed.)

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Yannick Versley
Sent: 24 June 2011 14:50
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] [ cwb-Feature Requests-2966922 ] CQPweb: Annotate query

Dear Andrew et al.,

we've been using something roughly similar for different tasks, including
the senses of discourse connectives, although with spans that come out
of a program and not directly from a CQP query result.
80-90% of all ad-hoc tasks can be solved using simple categorization,
the next ~5% can be solved using some kind of simple schema
(multiple attributes, which can either be from a set of categories,
or free text), and beyond that come annotation tasks that are beyond
the scope of a simple query tool (I think).
If you delegate the annotation to students, you may also want to
have a text field for comments (which is typically larger than one
that has just a lemma, and used ).

In our case, I included support for multiple annotators (i.e., you have
corpus x span x annotator as the primary key for a given annotation)
and the whole thing is saved to a MongoDB database instead of MySQL
(which means that you can change the scheme after annotating some
data, which is possible but difficult in MySQL).

I would picture myself writing bits of code for the case you describe, but
you could probably also make this easier for non-programming users if
you allow them to pre-populate a field with, say, something like
"[p-attr] lemma of next token to the [right/left]right, within [number]10 words,
where [p-attr] POS is [regex] V.*"
which would reduce the annotation effort somewhat.

In a board meeting after the second CQP tutorial at DGfS (German
linguistics society;
the tutorial was by Stefan Evert and Heike Zinsmeister), some people
asked about more advanced possibilities, including annotation - I guess that
"categorize query" or some more advanced "annotate query" would be right
up the alley of linguists who would like to do serious work with corpora but
are put off by the technical complications of it all (and who, through whatever
means, can get access to a CQPweb installation ;-) ).

Best wishes,
Yannick

On Fri, Jun 24, 2011 at 3:00 PM, SourceForge.net
<noreply at sourceforge.net> wrote:
> Feature Requests item #2966922, was opened at 2010-03-09 20:20
> Message generated for change (Comment added) made by nobody
> You can respond by visiting:
> https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2966922&group_id=131809
>
> Please note that this message will contain a full copy of the comment thread,
> including the initial issue submission, for this request,
> not just the latest update.
> Category: CQPweb
> Group: None
> Status: Open
> Priority: 1
> Private: No
> Submitted By: Andrew Hardie (andrewhardie)
> Assigned to: Andrew Hardie (andrewhardie)
> Summary: CQPweb: Annotate query
>
> Initial Comment:
> CQPweb: Annotate query
>
> These are basic design notes for a proposed Annotate query function which will extend (And possibly ultimately subsume) the existing Categorise query functionality. As such, it will be one of the rare CQPweb features that is not a
>
> Comments on the proposal on this sourceforge thread are welcome although its not currently a high priority.
>
> Current situation: Categorise query allows you to define a set of categories and then annotatate each line of a given concordance by assigning one of those categories to it. Categories are effectively values of a single attribute, where the attribute values are limited to a set. But potentially, we might want to annotate free values. Below is an example of why.
>
>
> Problem:
>
> Say you are doing a Gries-style collostruction analysis of the BE GOING TO + VERB construction, and you want to know what collexemes are in the VERB slot.
>
> So a search for _VVGK is the starting point.
>
> Then you need to annotate your head verb (inconsistent position). You need its lemma. Cant be done automatically. You want to assign a label (the lemma) to each conc line. But non-finite set of labels.
>
> Current solution: download and analyse in exsel. Unsatisfactory  innovation in tools is a necessity, also avoids stagnation of widely-used methodologies.
>
> We want the tools to do it for us  no download, we might want to reupload (which is currently possible, but
>
> Solution:
>
> (1)     Add extra menu option, Label query << cos I will use the word annotate for something else later.
> (2)     Like categorise query, you get to name the query. But you dont specify values. Instead, you get an empty text box to type whatever you want.
> (3)     You can save this, just like a categorise query. Realised as a database, like Categorise Query.
> (4)     Limited to \w and 0x20, for safety. Use a regex filter.
>
> Then, you can search the label column to extract subsets. It cant be a straight split like with categorise or else youd get too many subsets. Instead, specify a regular expression: any instance which matches that regex goes into a new query.
>
> Alternatively, you can get a frequency breakdown of the contents of that
>
> This would be the data youd need for the collostruction analysis I gave as an example: a list of lemma labels, with frequencies, in the verb-slot of that construction.
>
> =====
> Of course, this raises the question: shouldnt we go further and allow multiple annotation fields?
>
> EG for Gries/Divjak style behavioural profiles: every example is annotated with multiple attribute-value pairs; the results are then the input to exploratory statistics (hierarchical cluster analysis in this case). One attribute would identify the groups you are trying to cluster (e.g. senses of one word, or which of two near-synonyms it is). The other attributes would need to be what G/D call ID Tags.
>
> Why shouldnt it be possible to have multiple manually-adjustable attributes in CQPweb? Why should people have to download, annotate, reupload?
>
> In this case, the procedure would probably be as follows:
>
> 1)      you can define an annotation SCHEME. The scheme specifies a list of attributes, and whether they are labels or a closed-list. If it is a closed list all possible values are listed too. (saves redefining multiple lists of attributes and values at the time of creating your annotated query)
> 2)      A separate table for this, something like manual_annot_schemes
> 3)      You have an Annotate query option which allows you to link your query to one of the schemes you have defined
> 4)      Schemes can be public across a corpus or installation  allowing, for example, teachers to set up the categories that they then give to students to apply to a concordance.
> 5)      Query + scheme = shape of database. A record in saved_manual_annots keeps track of it, the actual data is scored in a separate table of corpus positions for the hit plus as many fields as necessary.
>
> It might well be possible to have R in the background so that the cluster analysis, or other exploratory statistics, could be applied automatically.
>
> Or to compare the results of applying a scheme to one query, to the results of applying it the scheme to another query.
>
> (There would of course need to be a sophisticated web interface to managing all of this and manipulating the results of annotating a query).
>
> Now note. The categorise query and label query functions become special cases (single-column annotation schemes) of annotate query.
>
> They should probably be kept for compatibility however. (on the fly automatic creation of annotation schemes when you define categories relevant to a specific query).
>
> Now, the questions: would this be a useful feature? Should it work as described, or otherwise?
>
>
> ----------------------------------------------------------------------
>
> Comment By: Nobody/Anonymous (nobody)
> Date: 2011-06-24 13:00
>
> Message:
> E3cSHM  <a href="http://kzgejfqomskw.com/">kzgejfqomskw</a>,
> [url=http://raqsxvsfqpmc.com/]raqsxvsfqpmc[/url],
> [link=http://swdxxlkidraz.com/]swdxxlkidraz[/link],
> http://enqxfdkirjeh.com/
>
> ----------------------------------------------------------------------
>
> Comment By: Andrew Hardie (andrewhardie)
> Date: 2010-03-11 10:47
>
> Message:
> Other things:
> -- make columns interconvertible between labels and closed-list (ie
> "levels" and free-text)
>
> ----------------------------------------------------------------------
>
> You can respond by visiting:
> https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2966922&group_id=131809
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb