[CWB] suffix arrays based on CWB indexes

Paul Meurer Paul.Meurer at uni.no
Thu Jan 6 15:43:36 CET 2011


I have done an implementation of a corpus engine with an input format similar to CWB's that uses suffix arrays, but only for string regexp matching. It should not be too complicated to use suffix arrays also with token granularity, depending on the size of your corpus. I use a version of sary (http://sary.sourceforge.net/) which I reimplemented in Common Lisp with 64bit support. (It comes only with 32bit support, which is to little for larger lexica.)

In any case, if you are interested, I can send you an article draft describing my architecture.

The system is still in beta stage (although it is used in a couple of projects in Bergen). Hopefully soon, I will put online some corpora using this system, which is called Korpuskel. 

> Is someone aware of any implementation of suffix arrays algorithms
> based on CWB indexes ?
> We plan to develop token (versus character) based n-grams of any
> length in the TXM context (http://textometrie.ens-lyon.fr/?lang=en)
> which is based on CWB.
> Milos Jakubicek said at PACLIC24 that Manatee (which could have
> some similarity of architecture with CWB) integrates suffix
> arrays, has anyone experience of that ?


Paul Meurer

Uni Computing
Allégt. 27, N-5007 Bergen, Norway
Phone +47 55 58 97 94

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20110106/9f874f3d/attachment.htm

More information about the CWB mailing list