[CWB] web-interface with aligned corpora and WebCqp::Persistent

Wed Feb 21 00:09:57 CET 2007

Dear Jörg,

thanks for your interest, and sorry that we kept you waiting a little  
for an answer (turnaround times tend to be a little on the slow side  
on this mailing list :o).

On 19 Feb 2007, at 17:56, Joerg Tiedemann wrote:

> I would like to use WebCqp::Persistent for a web interface with  
> aligned
> corpora. I started  looking at the CQPDemo scripts and I would like to
> have a similar interface only with additonal columns in the resulting
> KWIC lists for aligned corpora. Would that be easy to add? did  
> somebody
> work on this already? I just wanted to ask before I start looking  
> further
> into the source code.

I would be surprised if anyone had tried this (but perhaps someone  
has a surprise in store for me? -- I'd be delighted), since the  
CQPdemo source code is rather messy and ad-hoc, and not really  
designed to be flexible and extensible.

That much for the bad news -- the good news is that I'm currently  
working on the CWB release on sf.net, at last, and also on a new  
version of the CWB/Perl interface (mostly renaming and clean-up, but  
also a new standardised simplified query language dubbed CEQL, and  
I'll make sure that it's fully consistent with the CWB source code  
that will be released).  I've already been thinking about the  
possibility of re-engineering the CQPdemo so that we could use it as  
a simple but convenient Web interface for our own corpora.  The  
problem here is how to make it flexible enough to support a larger  
set of corpora (the current demo is tuned for a specific corpus, with  
quite a bit of hard-coded formatting choices) without increasing the  
complexity substantially (because it's about as complex as you can  
sensibly get without a full-fledged database and user-management  
system in the background).

If that is what you have in mind, I'd be delighted to cooperate on an  
improved version of the CQPdemo Web interface!  Alignment display  
wasn't on my agenda, but it would be nice to have since we sometimes  
work with parallel corpora, too.  This can only be seen as a short- 
term solution.  In the long term, we really need a completely new and  
more professional system with user and session management, as well as  
a good plugin architecture for extending the range of display options.

> I have some further questions about handling aligned regions in CWB:
>
> * how do I get aligned regions for a given corpus position in the  
> source
>   language? is that easy to do and fast to retrieve?

It's easy and relatively fast if you access the corpus directly  
through the CL library (preferably encapsulated in the CL.pm Perl  
module).  Here's the relevant part from the (very sketchy) CL.pm  
manpage:

          # CL::AttAlign objects
          $french = $corpus->attribute("hansard-fr", 'a'); # returns  
CL::AttAlign object (alignment attribute)
          $nr_of_alignments = $french->max_alg;          # alignment  
block numbers are 0 .. $nr_of_alignments-1
          $extended = $french->has_extended_alignment;   # extended  
alignment allows gaps & crossing alignments

          $alg = $french->cpos2alg($cpos);               # returns  
undef if no alignment was found
          ($src_start, $src_end, $target_start, $target_end)
               = $french->alg2cpos($alg);                # returns  
empty list on error

If you're fluent in Perl, this should be more or less self-explanatory.

In CQP, there's a secret trick that allows you to display alignment  
regions as context in the source language (rather than e.g.  
sentences), so that you get actual alignment beads in combination  
with the standard alignment attribute display.  You just have to set  
the display context to the name of the relevant alignment attribute  
(e.g. "set context hansard-fr"), as if it were an s-attribute.

No guarantees that it'll work and not crash CQP, though, and it may  
no longer be supported by future CWB versions.

> * can I restrict queries to get only results for which there are  
> aligned
>   regions? I have to deal with partially aligned corpora and I  
> don't want
>   to see matches with (no alignment found). Now I do post filtering  
> but
>   that isn't very effecient. I just tried to add dummy queries for the
>   target languages (".*") and that actually seems to work. But  
> maybe there
>   is a better way of doing this?!

CQP doesn't offer a more straightforward solution than this post- 
filtering approach, and I'm glad to hear that there doesn't seem to  
be a major speed penalty according to Lars (I expected that it would  
be slow, based on previous experience with the rather flaky  
implementation of aligned queries).  However, if your corpus comes in  
a pre-aligned form (so you don't want to run an alignment program on  
the encoded corpus) and you need to translate the alignment  
information to CWB format anyway, there is a general and very  
convenient solution.

a) Add s-attributes (i.e. XML tags) to each corpus that carry unique  
identifiers for (bilingual) alignment beads.  Note that the  
identifiers have to be exactly the same for each pair of aligned  
languages.  If alignment beads for different language pairings may  
have different sizes or overlap (within a single language corpus),  
then you'll have to defined separate s-attributes for every  
alignment.   E.g. if you have parallel texts in DE, EN and FR with  
the following sentence alignments:  D1+D2--E1+E2 D3--E3 and D1--F1  
NULL--F2 D3--F3 , the corpora would look like this:

DE:
<a_de_en id="de_en_1">
<a_de_fr id="de_fr_1">
[sentence 1]
</a_de_fr>
[sentence 2]
</a_de_en>
<a_de_en id="de_en_2">
<a_de_fr id="de_fr_3">
[sentence 3]
</a_de_fr>
</a_end>

EN:
<a_de_en id="de_en_1">
[sentence 1]
[sentence 2]
</a_de_en>
<a_de_en id="de_en_2">
[sentence 3]
</a_end>

FR:
<a_de_fr id="de_fr_1">
[sentence 1]
</a_de_fr>
[sentence 2]
<a_de_fr id="de_fr_3">
[sentence 3]
</a_de_fr>

b) Encode all 3 corpora, then use the cwb-align sentence aligner to  
perform a dummy alignment on the <a_de_en> and <a_de_fr> regions,  
respectively.  This generates a text file with corpus position of the  
alignment beads, which can then be encoded into CWB format with cwb- 
align-encode.

c) Now you can use the standard alignment attributes for aligned  
queries and display of aligned sentences, but you can also use the  
additional s-attributes to test whether each sentence is aligned to a  
given language, to display alignment regions as context, or to find  
the corpus positions of an alignment region in the source language.   
E.g. to restrict a query to German sentences that are aligned to the  
English corpus:

   ... query ...  within a_de_en;

Hope this helps!
Stefan

--
"We killed Linux support in the CWB"
   "You bastards!"
                             -- "CWBdev Park", December Fool's episode