[CWB] multiple commands with WebCqp::Query

Lars Nygaard lars.nygaard at iln.uio.no
Fri Aug 25 17:32:23 CEST 2006



Stefan Evert wrote:

>> * the "'cut' applies too early" bug. I have to maintain a rather  ugly 
>> peice of code to simulate this feature (it's not as simple as  it 
>> might sound ...)
>>
> 
> Yes, I think that problem (it happens for aligned queries, doesn't  it?) 
> is rather difficult to solve because of the complicated way in  which 
> CQP queries are evaluated - it'll probably need rewrites of  major parts 
> of CQP.
> 
> How problematic is this issue? Would it be very expensive to run the  
> full query (or at least with a much higher "cut" value), and then  just 
> take the first <n> matches (or even better, <n> randomly  selected 
> matches courtesy of "reduce")?

Well, that is how it works now: i.e.:
	- if the user wants randomized results, there is no problem (the 
problem only occurs with "cut")
	- if the user does not want randomized results, I do run the full query 
(setting a hig "cut" value will not work in all cases since the 
alignement constraint can apply for a tiny fraction of cases).

Looking at my code again, I see that the issue is in fact quite simple 
(and the ugly code was for a combination of this and other issues, and 
can be now be discarded), but the speed issue still remains. The fact 
that I'm working with a cgi application makes speed a more important 
factor, since using cqp through the perl modules slows everything down 
quite a bit; so if people search for common words in large corpora, this 
might be a problem (since some users really dislike the randomize 
function: they want the same results as the last time they ran the query 
...).

> 
>>
>> * the "get position of matching phrase in aligned region" feature.  
>> People that use parallell corpora spend a lot of time looking for  the 
>> "corresponding phrase" in aligned regions. It would be super- neat to 
>> be able to highlight it. I'm considering hacking something  together 
>> (I guess I whould have to run a separate search on the  aligned 
>> corpus, and then interpolate the position information; but  again 
>> there are some problems with that approach.
> 
> 
> This is definitely going to be difficult.  Perhaps aligned queries  
> should be split into a two-stage process, where first you run a query  
> and then you filter the results with an alignment constraint.  I've  
> been thinking about a "translate" command for some time, which would  be 
> much easier to implement than major modifications of the aligned  query 
> mechanism, and which would take a query result and return the  
> corresponding regions in an aligned corpus.  With "translate" and  some 
> other helper commands, it should be possible to find a  workaround for 
> your problem, along these lines:
> 
>  > EUROPARL-DE;
>  > German = < ... query ... >;
>  > English-Regions = translate German to EUROPARL-EN;
>  > English-Regions;
>  > English = < ... aligned query ...>;
>  > German-Regions = translate English to EUROPARL-DE;
>  > EUROPARL-DE;
>  > German = trim German German-Regions; # invented new command to  keep 
> only matches within German-Regions
>  > cat German 1 10;  # first 10 matches of query on EUROPARL-DE with  
> alignment constraint applied
>  > cat English 1 10;   # corresponding licensing matches in the  aligned 
> corpus EUROPARL-EN
> 
> Can you figure out what I've got in mind from this example?

Yes, this seems very sensible.

cheers,
lars


More information about the CWB mailing list