[CWB] web-interface with aligned corpora and WebCqp::Persistent
Stefan Evert
stefan.evert at uos.de
Wed Feb 21 00:09:57 CET 2007
Dear Jörg,
thanks for your interest, and sorry that we kept you waiting a little
for an answer (turnaround times tend to be a little on the slow side
on this mailing list :o).
On 19 Feb 2007, at 17:56, Joerg Tiedemann wrote:
> I would like to use WebCqp::Persistent for a web interface with
> aligned
> corpora. I started looking at the CQPDemo scripts and I would like to
> have a similar interface only with additonal columns in the resulting
> KWIC lists for aligned corpora. Would that be easy to add? did
> somebody
> work on this already? I just wanted to ask before I start looking
> further
> into the source code.
I would be surprised if anyone had tried this (but perhaps someone
has a surprise in store for me? -- I'd be delighted), since the
CQPdemo source code is rather messy and ad-hoc, and not really
designed to be flexible and extensible.
That much for the bad news -- the good news is that I'm currently
working on the CWB release on sf.net, at last, and also on a new
version of the CWB/Perl interface (mostly renaming and clean-up, but
also a new standardised simplified query language dubbed CEQL, and
I'll make sure that it's fully consistent with the CWB source code
that will be released). I've already been thinking about the
possibility of re-engineering the CQPdemo so that we could use it as
a simple but convenient Web interface for our own corpora. The
problem here is how to make it flexible enough to support a larger
set of corpora (the current demo is tuned for a specific corpus, with
quite a bit of hard-coded formatting choices) without increasing the
complexity substantially (because it's about as complex as you can
sensibly get without a full-fledged database and user-management
system in the background).
If that is what you have in mind, I'd be delighted to cooperate on an
improved version of the CQPdemo Web interface! Alignment display
wasn't on my agenda, but it would be nice to have since we sometimes
work with parallel corpora, too. This can only be seen as a short-
term solution. In the long term, we really need a completely new and
more professional system with user and session management, as well as
a good plugin architecture for extending the range of display options.
> I have some further questions about handling aligned regions in CWB:
>
> * how do I get aligned regions for a given corpus position in the
> source
> language? is that easy to do and fast to retrieve?
It's easy and relatively fast if you access the corpus directly
through the CL library (preferably encapsulated in the CL.pm Perl
module). Here's the relevant part from the (very sketchy) CL.pm
manpage:
# CL::AttAlign objects
$french = $corpus->attribute("hansard-fr", 'a'); # returns
CL::AttAlign object (alignment attribute)
$nr_of_alignments = $french->max_alg; # alignment
block numbers are 0 .. $nr_of_alignments-1
$extended = $french->has_extended_alignment; # extended
alignment allows gaps & crossing alignments
$alg = $french->cpos2alg($cpos); # returns
undef if no alignment was found
($src_start, $src_end, $target_start, $target_end)
= $french->alg2cpos($alg); # returns
empty list on error
If you're fluent in Perl, this should be more or less self-explanatory.
In CQP, there's a secret trick that allows you to display alignment
regions as context in the source language (rather than e.g.
sentences), so that you get actual alignment beads in combination
with the standard alignment attribute display. You just have to set
the display context to the name of the relevant alignment attribute
(e.g. "set context hansard-fr"), as if it were an s-attribute.
No guarantees that it'll work and not crash CQP, though, and it may
no longer be supported by future CWB versions.
> * can I restrict queries to get only results for which there are
> aligned
> regions? I have to deal with partially aligned corpora and I
> don't want
> to see matches with (no alignment found). Now I do post filtering
> but
> that isn't very effecient. I just tried to add dummy queries for the
> target languages (".*") and that actually seems to work. But
> maybe there
> is a better way of doing this?!
CQP doesn't offer a more straightforward solution than this post-
filtering approach, and I'm glad to hear that there doesn't seem to
be a major speed penalty according to Lars (I expected that it would
be slow, based on previous experience with the rather flaky
implementation of aligned queries). However, if your corpus comes in
a pre-aligned form (so you don't want to run an alignment program on
the encoded corpus) and you need to translate the alignment
information to CWB format anyway, there is a general and very
convenient solution.
a) Add s-attributes (i.e. XML tags) to each corpus that carry unique
identifiers for (bilingual) alignment beads. Note that the
identifiers have to be exactly the same for each pair of aligned
languages. If alignment beads for different language pairings may
have different sizes or overlap (within a single language corpus),
then you'll have to defined separate s-attributes for every
alignment. E.g. if you have parallel texts in DE, EN and FR with
the following sentence alignments: D1+D2--E1+E2 D3--E3 and D1--F1
NULL--F2 D3--F3 , the corpora would look like this:
DE:
<a_de_en id="de_en_1">
<a_de_fr id="de_fr_1">
[sentence 1]
</a_de_fr>
[sentence 2]
</a_de_en>
<a_de_en id="de_en_2">
<a_de_fr id="de_fr_3">
[sentence 3]
</a_de_fr>
</a_end>
EN:
<a_de_en id="de_en_1">
[sentence 1]
[sentence 2]
</a_de_en>
<a_de_en id="de_en_2">
[sentence 3]
</a_end>
FR:
<a_de_fr id="de_fr_1">
[sentence 1]
</a_de_fr>
[sentence 2]
<a_de_fr id="de_fr_3">
[sentence 3]
</a_de_fr>
b) Encode all 3 corpora, then use the cwb-align sentence aligner to
perform a dummy alignment on the <a_de_en> and <a_de_fr> regions,
respectively. This generates a text file with corpus position of the
alignment beads, which can then be encoded into CWB format with cwb-
align-encode.
c) Now you can use the standard alignment attributes for aligned
queries and display of aligned sentences, but you can also use the
additional s-attributes to test whether each sentence is aligned to a
given language, to display alignment regions as context, or to find
the corpus positions of an alignment region in the source language.
E.g. to restrict a query to German sentences that are aligned to the
English corpus:
... query ... within a_de_en;
Hope this helps!
Stefan
--
"We killed Linux support in the CWB"
"You bastards!"
-- "CWBdev Park", December Fool's episode
More information about the CWB
mailing list