[CWB] Manatee
Stefan Evert
stefan.evert at uos.de
Fri Feb 2 16:38:46 CET 2007
Hi Serge & all!
On 12 Jan 2007, at 11:17, Serge HEIDEN wrote:
> I would be very interested to hear from you about the Manatee package.
> It seems to be a VERY close clone of cqp and xkwic alltogether.
> I just discovered that it is now Open Source (GPL license) and
> downloadable at http://www.textforge.cz/download.html
It's actually been available for a few months by now. As far as I
know, Manatee is a direct reimplementation of CQP, with a client-
server based Tcl interface as an xkwic replacement. I suppose you're
all aware that a (presumably improved) version of Manatee is used for
corpus indexing and search in the SketchEngine?
> Has any of you experienced with it ?
I've only had a very quick look at the source code. I didn't think
that something could have less useful documentation than the CWB, but
Manatee proved me wrong ... If you dig a little deeper, you'll find
that the query language manual of Manatee is a link to a very
outdated version of the CQP Users Manual at the IMS.
I haven't tried compiling it and encoding a corpus, since there isn't
even a short readme that would explain how to do this. If anyone has
managed to get it to run and play around with it, I'd be very
interested to hear about their impressions.
> Or have infos about its corpus volumetry limits, performance, etc.
> It is written in C++, seems to have Unicode support, use 64bit file
> descriptors, on Linux and Solaris.
Yes, as far as I know this is all correct. It's definitely a much
better piece of software engineering than the CWB and has fewer built-
in limitations (unless you count having to install a suitable version
of ICU in order to compile it as a liability).
The SketchEngine seems to be all Unicode, support corpora of up to 2
billion words at least (I don't know if it can deal with larger
corpora, but I would be suprised if more than 4 billion words were
possible), and the demos I have seen had very impressive performance
for simple queries (looking up a word, word with wildcards or a short
phrase), though I don't know to what extent this may have been due to
result caching. It would also seem that performance degrades more
graciously as queries become more complex than in CQP (where non-
trivial queries are often much slower than word lookup). I also
don't know whether the open-source version of Manatee is identical to
what the SketchEngine uses, so performance etc. might differ.
My two or three cents ...
:o)
Stefan
More information about the CWB
mailing list