[CWB] multiple commands with WebCqp::Query
Lars Nygaard
lars.nygaard at iln.uio.no
Fri Aug 25 17:32:23 CEST 2006
Stefan Evert wrote:
>> * the "'cut' applies too early" bug. I have to maintain a rather ugly
>> peice of code to simulate this feature (it's not as simple as it
>> might sound ...)
>>
>
> Yes, I think that problem (it happens for aligned queries, doesn't it?)
> is rather difficult to solve because of the complicated way in which
> CQP queries are evaluated - it'll probably need rewrites of major parts
> of CQP.
>
> How problematic is this issue? Would it be very expensive to run the
> full query (or at least with a much higher "cut" value), and then just
> take the first <n> matches (or even better, <n> randomly selected
> matches courtesy of "reduce")?
Well, that is how it works now: i.e.:
- if the user wants randomized results, there is no problem (the
problem only occurs with "cut")
- if the user does not want randomized results, I do run the full query
(setting a hig "cut" value will not work in all cases since the
alignement constraint can apply for a tiny fraction of cases).
Looking at my code again, I see that the issue is in fact quite simple
(and the ugly code was for a combination of this and other issues, and
can be now be discarded), but the speed issue still remains. The fact
that I'm working with a cgi application makes speed a more important
factor, since using cqp through the perl modules slows everything down
quite a bit; so if people search for common words in large corpora, this
might be a problem (since some users really dislike the randomize
function: they want the same results as the last time they ran the query
...).
>
>>
>> * the "get position of matching phrase in aligned region" feature.
>> People that use parallell corpora spend a lot of time looking for the
>> "corresponding phrase" in aligned regions. It would be super- neat to
>> be able to highlight it. I'm considering hacking something together
>> (I guess I whould have to run a separate search on the aligned
>> corpus, and then interpolate the position information; but again
>> there are some problems with that approach.
>
>
> This is definitely going to be difficult. Perhaps aligned queries
> should be split into a two-stage process, where first you run a query
> and then you filter the results with an alignment constraint. I've
> been thinking about a "translate" command for some time, which would be
> much easier to implement than major modifications of the aligned query
> mechanism, and which would take a query result and return the
> corresponding regions in an aligned corpus. With "translate" and some
> other helper commands, it should be possible to find a workaround for
> your problem, along these lines:
>
> > EUROPARL-DE;
> > German = < ... query ... >;
> > English-Regions = translate German to EUROPARL-EN;
> > English-Regions;
> > English = < ... aligned query ...>;
> > German-Regions = translate English to EUROPARL-DE;
> > EUROPARL-DE;
> > German = trim German German-Regions; # invented new command to keep
> only matches within German-Regions
> > cat German 1 10; # first 10 matches of query on EUROPARL-DE with
> alignment constraint applied
> > cat English 1 10; # corresponding licensing matches in the aligned
> corpus EUROPARL-EN
>
> Can you figure out what I've got in mind from this example?
Yes, this seems very sensible.
cheers,
lars
More information about the CWB
mailing list