[CWB] suffix arrays based on CWB indexes

Fri Jan 7 18:48:08 CET 2011

Serge,

> Thank you very much for the interesting reference.
> I would be very interested to read your architecture.

Thank you for your interest. You can download my article from here:

	http://maximos.aksis.uib.no/~paul/articles/Korpuskel-draft.pdf

> In the CWB context, I thought that the native (binary)
> indexes being geared toward FSM resolution - if I remember
> correctly CQL expressions are compiled in FSMs - could
> also maybe help for regexp resolution on ngrams through
> suffix arrays, or not. It must the case that specific
> binary indexes have to be built anyway.

Yes, a specific index has to be built - the suffix array. The CWB binary files won't help you, as far as I can see. But you can use the CWB vertical file right away as the file you generate the suffix array for, if it contains the word tokens as only attribute. If you then index the beginning of every line, you will be able to look up arbitrary n-grams very easily. 

> We will look
> at Sary thank you,

Sary is very easy to use, and the code is well-documented. But as I said, Sary only supports 32bit integers, so your vertical file cannot be bigger than 2^32 byte = 4.3GB. This is quite a lot, though, perhaps enough for a 500 million word corpus. It shouldn't be a big deal to make Sary support 64bit integers; I tried a bit, but not too hard, and did not succeed.

> and your Common Lisp if it is available - and written in CLOS

My code is less well documented, and only a partial implementation of Sary, as much as I needed, and in addition, regexp support for suffix arrays. Anyway, here is it, although I doubt it will be very useful for you, since ...

> (we will code in Java ;-)

	http://maximos.aksis.uib.no/~paul/code/sary.zip

Best,
Paul

> 
> Best,
> Serge
> 
> le 06/01/2011 15:43 Selon Paul Meurer:
>> Serge,
>> 
>> I have done an implementation of a corpus engine with an input format
>> similar to CWB's that uses suffix arrays, but only for string regexp
>> matching. It should not be too complicated to use suffix arrays also
>> with token granularity, depending on the size of your corpus. I use a
>> version of sary (http://sary.sourceforge.net/) which I reimplemented in
>> Common Lisp with 64bit support. (It comes only with 32bit support, which
>> is to little for larger lexica.)
>> 
>> In any case, if you are interested, I can send you an article draft
>> describing my architecture.
>> 
>> The system is still in beta stage (although it is used in a couple of
>> projects in Bergen). Hopefully soon, I will put online some corpora
>> using this system, which is called Korpuskel.
>> 
>>> Is someone aware of any implementation of suffix arrays algorithms
>>> based on CWB indexes ?
>>> We plan to develop token (versus character) based n-grams of any
>>> length in the TXM context (http://textometrie.ens-lyon.fr/?lang=en)
>>> which is based on CWB.
>>> Milos Jakubicek said at PACLIC24 that Manatee (which could have
>>> some similarity of architecture with CWB) integrates suffix
>>> arrays, has anyone experience of that ?
>> 
>> Best,
>> Paul
>> 
>> --
>> Paul Meurer
>> 
>> Uni Computing
>> Allégt. 27, N-5007 Bergen, Norway
>> Phone +47 55 58 97 94
>> http://uni.no <http://uni.no/>/digital
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> -- 
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Paul