[CWB] multiple commands with WebCqp::Query
Stefan Evert
stefan.evert at uos.de
Fri Aug 25 16:57:14 CEST 2006
> Speaking of the ToDo list: I'm sure many people have things they
> want fixed or changed in CWB. My main use of CWB is as a backend
> for a web application for exploring corpora (a bit like BNCWeb),
> and for my part, there are basically just two important things
> remaining on my whish list (a lot of it turned out to be possible
> to do already, I just didn't know how):
Have you entered them in the sf.net bug tracker / feature request
system (I think at least the "cut" issue is up there, just to make
sure that we don't forget about them ... :o).
> * the "'cut' applies too early" bug. I have to maintain a rather
> ugly peice of code to simulate this feature (it's not as simple as
> it might sound ...)
>
Yes, I think that problem (it happens for aligned queries, doesn't
it?) is rather difficult to solve because of the complicated way in
which CQP queries are evaluated - it'll probably need rewrites of
major parts of CQP.
How problematic is this issue? Would it be very expensive to run the
full query (or at least with a much higher "cut" value), and then
just take the first <n> matches (or even better, <n> randomly
selected matches courtesy of "reduce")?
>
> * the "get position of matching phrase in aligned region" feature.
> People that use parallell corpora spend a lot of time looking for
> the "corresponding phrase" in aligned regions. It would be super-
> neat to be able to highlight it. I'm considering hacking something
> together (I guess I whould have to run a separate search on the
> aligned corpus, and then interpolate the position information; but
> again there are some problems with that approach.
This is definitely going to be difficult. Perhaps aligned queries
should be split into a two-stage process, where first you run a query
and then you filter the results with an alignment constraint. I've
been thinking about a "translate" command for some time, which would
be much easier to implement than major modifications of the aligned
query mechanism, and which would take a query result and return the
corresponding regions in an aligned corpus. With "translate" and
some other helper commands, it should be possible to find a
workaround for your problem, along these lines:
> EUROPARL-DE;
> German = < ... query ... >;
> English-Regions = translate German to EUROPARL-EN;
> English-Regions;
> English = < ... aligned query ...>;
> German-Regions = translate English to EUROPARL-DE;
> EUROPARL-DE;
> German = trim German German-Regions; # invented new command to
keep only matches within German-Regions
> cat German 1 10; # first 10 matches of query on EUROPARL-DE with
alignment constraint applied
> cat English 1 10; # corresponding licensing matches in the
aligned corpus EUROPARL-EN
Can you figure out what I've got in mind from this example?
Best,
Stefan
More information about the CWB
mailing list