[CWB] suffix arrays based on CWB indexes

Serge Heiden slh at ens-lyon.fr
Fri Jan 7 13:56:09 CET 2011


Paul,

Thank you very much for the interesting reference.
I would be very interested to read your architecture.

In the CWB context, I thought that the native (binary)
indexes being geared toward FSM resolution - if I remember
correctly CQL expressions are compiled in FSMs - could
also maybe help for regexp resolution on ngrams through
suffix arrays, or not. It must the case that specific
binary indexes have to be built anyway. We will look
at Sary thank you, and your Common Lisp if it is
available - and written in CLOS (we will code in
Java ;-)

Best,
Serge

le 06/01/2011 15:43 Selon Paul Meurer:
> Serge,
>
> I have done an implementation of a corpus engine with an input format
> similar to CWB's that uses suffix arrays, but only for string regexp
> matching. It should not be too complicated to use suffix arrays also
> with token granularity, depending on the size of your corpus. I use a
> version of sary (http://sary.sourceforge.net/) which I reimplemented in
> Common Lisp with 64bit support. (It comes only with 32bit support, which
> is to little for larger lexica.)
>
> In any case, if you are interested, I can send you an article draft
> describing my architecture.
>
> The system is still in beta stage (although it is used in a couple of
> projects in Bergen). Hopefully soon, I will put online some corpora
> using this system, which is called Korpuskel.
>
>> Is someone aware of any implementation of suffix arrays algorithms
>> based on CWB indexes ?
>> We plan to develop token (versus character) based n-grams of any
>> length in the TXM context (http://textometrie.ens-lyon.fr/?lang=en)
>> which is based on CWB.
>> Milos Jakubicek said at PACLIC24 that Manatee (which could have
>> some similarity of architecture with CWB) integrates suffix
>> arrays, has anyone experience of that ?
>
> Best,
> Paul
>
> --
> Paul Meurer
>
> Uni Computing
> Allégt. 27, N-5007 Bergen, Norway
> Phone +47 55 58 97 94
> http://uni.no <http://uni.no/>/digital
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883


More information about the CWB mailing list