[CWB] query optimization

Wed Aug 16 16:45:37 CEST 2006

As I already mentioned in my previous e-mail ...

> OK. But perhaps it is possible to reverse it? I.e. if the last  
> token is less frequent, run the query execution backwards (in  
> therms of the sequence in the corpus). Of course, that might be  
> utter nonsense for someone who knows the internals.

... you're completely right.  In principle, it would be easy to  
reverse the FSA generated from a regular expression query and  
evaluate the query "backwards" if you find that the last term is very  
infrequent.  However, this would just catch one more special case -  
think about queries like the following:

   "the" "blithe" [pos="NN.*"]

Here, both the first and the last term match very many tokens. In  
order to evaluate this query efficiently using a positional index,  
one would have to start evaluation at "blithe" and run the FSA in  
both directions from there.  Slightly more complex queries may even  
require multiple start positions (think disjunctions, optionality,  
Kleene star), which might turn out to belong to the same overall  
match or different matches.  I guess it's obvious now that this  
approach becomes very complicated, and might make a good topic for a  
master's thesis ...

I suspect that a better approach towards optimization would be to  
translate the regexp-like CQP query into a conjunction of constraints  
(anybody know TIGERSearch?) and then try to optimize evaluation of  
those constraints as you would optimize a SQL query.  Still very  
challenging, though, unless you just want to optimize the simple  
queries that 95% of Google users type in.

Perhaps more on ideas for query optimization in a future tech note -  
and I hope that you'll chime in with your ideas when I've explained  
the query evaluation mechanism at last.

All the best,
Stefan