[CWB] [ cwb-Feature Requests-2898799 ] a slow CQP query...

SourceForge.net noreply at sourceforge.net
Mon Aug 1 01:21:43 CEST 2011


Feature Requests item #2898799, was opened at 2009-11-17 00:39
Message generated for change (Settings changed) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2898799&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQP
>Group: TODO-4.0
Status: Open
Priority: 1
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Nobody/Anonymous (nobody)
Summary: a slow CQP query...

Initial Comment:
The following query seemed to run rather slowly (on the BNC: someone wanted instances of "done" used for past tense.)

[hw!="be" & hw!="have"] [pos="PNP"] [word="done"]

My guess is that the slowness is because of the gargantuan number of matches for the first element in the string. (Stefan might correct me on this!) Which leads me to wonder, is there any way of optimising this kind of thing and speeding it up? OR does this happen already?


----------------------------------------------------------------------

Comment By: Andrew Hardie (andrewhardie)
Date: 2009-11-17 18:26

Message:
A thought occurs: could we possibly get quite a lot of improvement with
some fairly simple rules of thumb moving away from left-to-right matching?

For instance: it could be assumed that a regex with no wildcards of any
sort (ie just a literal string) will have relatively few matches (not
always true,but just as a rule of thumb). Then, when doing a multi-match
query, each token could be analysed and rated along lines like this:

one or more conditions of != with a string with no wildcards: HARD
one or more conditions of = with a string with no wildcards: EASY
everything else, or multiple conditions of different kinds: MIDDLING
(probably other criteria could be added)

If CQP then started matching with the leftmost EASY token (if any), or
failing that the leftmost MIDDLING token (if any), and only last resorted
to HARD tokens, this kind of slowdown could be avoided. Perhaps this could
be implemented by internally (within CQP) doing something akin to the
multi-step procedure with the first step still being a left-to-right
matching query.

Sorry if this is all a silly idea, I'm still only starting to get to grips
with CQP's internals and pondering about things as they occur to me.

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2009-11-17 18:00

Message:
Exactly.  CQP uses a strict left-to-right matching strategy where index
lookups are only performed for the first token.  Unfortunately, CQP doesn't
have a good built-in query optimiser, and it isn't at all fun to develop
one in C. :-(

In command-line CQP, you can use a multi-step query for such cases, which
runs much faster (although [pos="PNP"] still has a lot of matches):

  A = [pos="PNP"] [word="done"];
  set A target nearest [hw = "be|have"] within left 1 word;
  delete A without target;

We thought about having special-case optimisations for this construction
in BNCweb's simple query language, but handling the multi-step queries was
just too complex to be feasible.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2898799&group_id=131809


More information about the CWB mailing list