[CWB] multiple commands with WebCqp::Query

Stefan Evert stefan.evert at uos.de
Fri Aug 25 16:57:14 CEST 2006


> Speaking of the ToDo list: I'm sure many people have things they  
> want fixed or changed in CWB. My main use of CWB is as a backend  
> for a web application for exploring corpora (a bit like BNCWeb),  
> and for my part, there are basically just two important things  
> remaining on my whish list (a lot of it turned out to be possible  
> to do already, I just didn't know how):

Have you entered them in the sf.net bug tracker / feature request  
system (I think at least the "cut" issue is up there, just to make  
sure that we don't forget about them ... :o).

> * the "'cut' applies too early" bug. I have to maintain a rather  
> ugly peice of code to simulate this feature (it's not as simple as  
> it might sound ...)
>

Yes, I think that problem (it happens for aligned queries, doesn't  
it?) is rather difficult to solve because of the complicated way in  
which CQP queries are evaluated - it'll probably need rewrites of  
major parts of CQP.

How problematic is this issue? Would it be very expensive to run the  
full query (or at least with a much higher "cut" value), and then  
just take the first <n> matches (or even better, <n> randomly  
selected matches courtesy of "reduce")?

>
> * the "get position of matching phrase in aligned region" feature.  
> People that use parallell corpora spend a lot of time looking for  
> the "corresponding phrase" in aligned regions. It would be super- 
> neat to be able to highlight it. I'm considering hacking something  
> together (I guess I whould have to run a separate search on the  
> aligned corpus, and then interpolate the position information; but  
> again there are some problems with that approach.

This is definitely going to be difficult.  Perhaps aligned queries  
should be split into a two-stage process, where first you run a query  
and then you filter the results with an alignment constraint.  I've  
been thinking about a "translate" command for some time, which would  
be much easier to implement than major modifications of the aligned  
query mechanism, and which would take a query result and return the  
corresponding regions in an aligned corpus.  With "translate" and  
some other helper commands, it should be possible to find a  
workaround for your problem, along these lines:

 > EUROPARL-DE;
 > German = < ... query ... >;
 > English-Regions = translate German to EUROPARL-EN;
 > English-Regions;
 > English = < ... aligned query ...>;
 > German-Regions = translate English to EUROPARL-DE;
 > EUROPARL-DE;
 > German = trim German German-Regions; # invented new command to  
keep only matches within German-Regions
 > cat German 1 10;  # first 10 matches of query on EUROPARL-DE with  
alignment constraint applied
 > cat English 1 10;   # corresponding licensing matches in the  
aligned corpus EUROPARL-EN

Can you figure out what I've got in mind from this example?

Best,
Stefan



More information about the CWB mailing list