[CWB] query optimization

Wed Aug 16 16:37:19 CEST 2006

Thanks for answering this while I was unavailable, Serge!

On 9 Aug 2006, at 16:31, Serge Sharoff wrote:

> Not sure if Stefan is available, of course, he has much better  
> knowledge of CWB intricacies.  But from what I learned during our  
> May meeting in Forli, there's no possibility to correct this in the  
> nearest future, as the query is essentially sequential, with a  
> query converted into an FSA.

Yes, that's absolutely correct. It isn't easy to optimize FSA  
execution with the help of a positional index in the first place.   
While it would be a straightforward operation to reverse a FSA and  
start from the last word of a match, as Lars suggested, several  
infelicitous decisions in the implementation of CQP's query  
evaluation (and the query language itself, for which I am not  
entirely unresponsible, I'm afraid to say) make it almost impossible  
to execute a CQP query "backwards". So there's little chance of such  
optimizations in the near future.

I'll try to write a first "tech note" on the CWB within the next few  
days, which explains the query evaluation strategies of CQP and  
should shed some light on what is feasible and what isn't.  But  
perhaps I should first write a description of the data formats and  
indexing methods used by the CWB, to give you the necessary technical  
background for understanding the query evaluation stuff?  (I'll post  
all tech notes to the CWBdev mailing list. Perhaps we can later put  
them up on the CWB homepage, after some discussion and corrections.)

>   At the same because of some bug/feature in CQP the index of all  
> occurrences of the first word in the query is computed twice,  
> making "the" "world" a monster interms of its processing time.

I'm not sure what you're referring to here. Is that something you  
(mis)understood from me during the Forlì meeting? In that case, I was  
probably trying to explain that CQP rather stupidly makes a list of  
all possible start positions for matches (in the example, that's the  
frequency of "the" times 4 bytes for each integer start position)  
before executing a query, and then even more stupidly adds a second  
list of the same size as placeholders for the end positions of  
matches (which will only be needed for successful matches, of  
course).  For this reason, the amount of temporary memory wasted to  
execute the query "the" "world" is the frequency of "the" times 8  
bytes per entry.

>   A way to optimise this is by using the MU syntax:
> MU(meet "the" "world" 0 1)
> (32 vs. just 2.3 sec on the BNC)

I was also going to suggest that. Even though a backward evaluation  
strategy might be even more efficient (MU looks up both "the" and  
"world" in the index), this Boolean query implementation is very  
efficient for such simple queries.  The only drawback is that the  
matches of your query will span only a single token (the leftmost  
token specified in the query, i.e. occurrences of "the" in this  
case).  If you want to find occurrences of "world" preceded by a  
"the", you could rewrite the query as MU(meet "world" "the" -1 -1).   
You could also some "set target" magic to expand the matches to  
include both words (depending on your query this might also work for  
variable distances): set Last match nearest "the" within left 1 word  
from match;

One of my plans for the future of CQP is to extend the MU query  
mechanism considerably (and e.g. allow to set targets on individual  
terms in the query), so that it can be used for many basic phrase and  
proximity queries, where it will be drastically faster than regular  
CQP queries.

My impression is that Manatee manages to degrade (more or less)  
gracefully from fast Boolean queries to increasingly complex regular  
expressions (which are probably evaluated by backtracking or FSA  
simulation when they reach a certain complexity).

That much for now, hope to write first tech notes soon,
Stefan