[CWB] query optimization
Stefan Evert
stefan.evert at uos.de
Wed Aug 16 16:45:37 CEST 2006
As I already mentioned in my previous e-mail ...
> OK. But perhaps it is possible to reverse it? I.e. if the last
> token is less frequent, run the query execution backwards (in
> therms of the sequence in the corpus). Of course, that might be
> utter nonsense for someone who knows the internals.
... you're completely right. In principle, it would be easy to
reverse the FSA generated from a regular expression query and
evaluate the query "backwards" if you find that the last term is very
infrequent. However, this would just catch one more special case -
think about queries like the following:
"the" "blithe" [pos="NN.*"]
Here, both the first and the last term match very many tokens. In
order to evaluate this query efficiently using a positional index,
one would have to start evaluation at "blithe" and run the FSA in
both directions from there. Slightly more complex queries may even
require multiple start positions (think disjunctions, optionality,
Kleene star), which might turn out to belong to the same overall
match or different matches. I guess it's obvious now that this
approach becomes very complicated, and might make a good topic for a
master's thesis ...
I suspect that a better approach towards optimization would be to
translate the regexp-like CQP query into a conjunction of constraints
(anybody know TIGERSearch?) and then try to optimize evaluation of
those constraints as you would optimize a SQL query. Still very
challenging, though, unless you just want to optimize the simple
queries that 95% of Google users type in.
Perhaps more on ideas for query optimization in a future tech note -
and I hope that you'll chime in with your ideas when I've explained
the query evaluation mechanism at last.
All the best,
Stefan
More information about the CWB
mailing list