[CWB] suffix arrays based on CWB indexes

Stefan Evert stefanML at collocations.de
Tue Jan 11 11:54:16 CET 2011


> Thank you for your interest. You can download my article from here:
> 
> 	http://maximos.aksis.uib.no/~paul/articles/Korpuskel-draft.pdf

Cool, this is very interesting! Thanks.

> Yes, a specific index has to be built - the suffix array. The CWB binary files won't help you, as far as I can see. But you can use the CWB vertical file right away as the file you generate the suffix array for, if it contains the word tokens as only attribute. If you then index the beginning of every line, you will be able to look up arbitrary n-grams very easily. 

I think your best bet will be to find (or implement) a suffix array library that accepts 32-bit integers as symbols and then run it on the internal numeric ID representation of CWB attributes -- this should be the most efficient way to generate a suffix array over token sequences.

While in principle it might be possible to interface a suffix array library written in C/C++ with directly with the CWB data structures, you can easily export the ID stream e.g. through the Perl/CWB API or using the CQi interface.

> Sary is very easy to use, and the code is well-documented. But as I said, Sary only supports 32bit integers, so your vertical file cannot be bigger than 2^32 byte = 4.3GB. This is quite a lot, though, perhaps enough for a 500 million word corpus.

If I understand your correctly, this would imply that Sary is run on the CWB data file as a byte stream, which doesn't seem to make much sense because (a) this only works with uncompressed data files (that do not store tokens in Huffman coding) and (b) it would include many spurious suffixes that do not start on an integer boundary (so each 4-byte sequence would produce a random nonsense value in this array).

If you can convince Sary to work on full 32-bit integers as symbols, then 32-bit offsets should at least be able to handle a corpus of 2^31 = 2 billion tokens -- which is the design limit of CWB corpora anyway.


Best wishes,
Stefan




More information about the CWB mailing list