[CWB] query optimization
Stefan Evert
stefan.evert at uos.de
Wed Aug 16 16:37:19 CEST 2006
Thanks for answering this while I was unavailable, Serge!
On 9 Aug 2006, at 16:31, Serge Sharoff wrote:
> Not sure if Stefan is available, of course, he has much better
> knowledge of CWB intricacies. But from what I learned during our
> May meeting in Forli, there's no possibility to correct this in the
> nearest future, as the query is essentially sequential, with a
> query converted into an FSA.
Yes, that's absolutely correct. It isn't easy to optimize FSA
execution with the help of a positional index in the first place.
While it would be a straightforward operation to reverse a FSA and
start from the last word of a match, as Lars suggested, several
infelicitous decisions in the implementation of CQP's query
evaluation (and the query language itself, for which I am not
entirely unresponsible, I'm afraid to say) make it almost impossible
to execute a CQP query "backwards". So there's little chance of such
optimizations in the near future.
I'll try to write a first "tech note" on the CWB within the next few
days, which explains the query evaluation strategies of CQP and
should shed some light on what is feasible and what isn't. But
perhaps I should first write a description of the data formats and
indexing methods used by the CWB, to give you the necessary technical
background for understanding the query evaluation stuff? (I'll post
all tech notes to the CWBdev mailing list. Perhaps we can later put
them up on the CWB homepage, after some discussion and corrections.)
> At the same because of some bug/feature in CQP the index of all
> occurrences of the first word in the query is computed twice,
> making "the" "world" a monster interms of its processing time.
I'm not sure what you're referring to here. Is that something you
(mis)understood from me during the Forlì meeting? In that case, I was
probably trying to explain that CQP rather stupidly makes a list of
all possible start positions for matches (in the example, that's the
frequency of "the" times 4 bytes for each integer start position)
before executing a query, and then even more stupidly adds a second
list of the same size as placeholders for the end positions of
matches (which will only be needed for successful matches, of
course). For this reason, the amount of temporary memory wasted to
execute the query "the" "world" is the frequency of "the" times 8
bytes per entry.
> A way to optimise this is by using the MU syntax:
> MU(meet "the" "world" 0 1)
> (32 vs. just 2.3 sec on the BNC)
I was also going to suggest that. Even though a backward evaluation
strategy might be even more efficient (MU looks up both "the" and
"world" in the index), this Boolean query implementation is very
efficient for such simple queries. The only drawback is that the
matches of your query will span only a single token (the leftmost
token specified in the query, i.e. occurrences of "the" in this
case). If you want to find occurrences of "world" preceded by a
"the", you could rewrite the query as MU(meet "world" "the" -1 -1).
You could also some "set target" magic to expand the matches to
include both words (depending on your query this might also work for
variable distances): set Last match nearest "the" within left 1 word
from match;
One of my plans for the future of CQP is to extend the MU query
mechanism considerably (and e.g. allow to set targets on individual
terms in the query), so that it can be used for many basic phrase and
proximity queries, where it will be drastically faster than regular
CQP queries.
My impression is that Manatee manages to degrade (more or less)
gracefully from fast Boolean queries to increasingly complex regular
expressions (which are probably evaluated by backtracking or FSA
simulation when they reach a certain complexity).
That much for now, hope to write first tech notes soon,
Stefan
More information about the CWB
mailing list