[CWB] Manatee

Stefan Evert stefan.evert at uos.de
Fri Feb 2 16:38:46 CET 2007


Hi Serge & all!

On 12 Jan 2007, at 11:17, Serge HEIDEN wrote:

> I would be very interested to hear from you about the Manatee package.
> It seems to be a VERY close clone of cqp and xkwic alltogether.
> I just discovered that it is now Open Source (GPL license) and
> downloadable at http://www.textforge.cz/download.html

It's actually been available for a few months by now.  As far as I  
know, Manatee is a direct reimplementation of CQP, with a client- 
server based Tcl interface as an xkwic replacement.  I suppose you're  
all aware that a (presumably improved) version of Manatee is used for  
corpus indexing and search in the SketchEngine?

> Has any of you experienced with it ?

I've only had a very quick look at the source code.  I didn't think  
that something could have less useful documentation than the CWB, but  
Manatee proved me wrong ...  If you dig a little deeper, you'll find  
that the query language manual of Manatee is a link to a very  
outdated version of the CQP Users Manual at the IMS.

I haven't tried compiling it and encoding a corpus, since there isn't  
even a short readme that would explain how to do this.  If anyone has  
managed to get it to run and play around with it, I'd be very  
interested to hear about their impressions.

> Or have infos about its corpus volumetry limits, performance, etc.
> It is written in C++, seems to have Unicode support, use 64bit file
> descriptors, on Linux and Solaris.

Yes, as far as I know this is all correct.  It's definitely a much  
better piece of software engineering than the CWB and has fewer built- 
in limitations (unless you count having to install a suitable version  
of ICU in order to compile it as a liability).

The SketchEngine seems to be all Unicode, support corpora of up to 2  
billion words at least (I don't know if it can deal with larger  
corpora, but I would be suprised if more than 4 billion words were  
possible), and the demos I have seen had very impressive performance  
for simple queries (looking up a word, word with wildcards or a short  
phrase), though I don't know to what extent this may have been due to  
result caching.  It would also seem that performance degrades more  
graciously as queries become more complex than in CQP (where non- 
trivial queries are often much slower than word lookup).  I also  
don't know whether the open-source version of Manatee is identical to  
what the SketchEngine uses, so performance etc. might differ.

My two or three cents ...
:o)
Stefan


More information about the CWB mailing list