[CWB] suffix arrays based on CWB indexes

Tue Jan 11 22:45:40 CET 2011

Hi Stefan,

>> Thank you for your interest. You can download my article from here:
>> 
>> 	http://maximos.aksis.uib.no/~paul/articles/Korpuskel-draft.pdf
> 
> Cool, this is very interesting! Thanks.

Nice to hear. Comments are welcome! (You remember that I contacted you briefly more than a year ago; this article is the outcome of what I had intended back then.)

>> Yes, a specific index has to be built - the suffix array. The CWB binary files won't help you, as far as I can see. But you can use the CWB vertical file right away as the file you generate the suffix array for, if it contains the word tokens as only attribute. If you then index the beginning of every line, you will be able to look up arbitrary n-grams very easily. 
> 
> I think your best bet will be to find (or implement) a suffix array library that accepts 32-bit integers as symbols and then run it on the internal numeric ID representation of CWB attributes -- this should be the most efficient way to generate a suffix array over token sequences.

So you would work with an alphabet of lexicon size -- interesting idea, haven't thought of that. That should work; sary supports UTF-8, my port does UTF-8 and UCS-2, and 32-bit integers is not far from that.

> While in principle it might be possible to interface a suffix array library written in C/C++ with directly with the CWB data structures, you can easily export the ID stream e.g. through the Perl/CWB API or using the CQi interface.
> 
>> Sary is very easy to use, and the code is well-documented. But as I said, Sary only supports 32bit integers, so your vertical file cannot be bigger than 2^32 byte = 4.3GB. This is quite a lot, though, perhaps enough for a 500 million word corpus.
> 
> If I understand your correctly, this would imply that Sary is run on the CWB data file as a byte stream, which doesn't seem to make much sense because (a) this only works with uncompressed data files (that do not store tokens in Huffman coding) and (b) it would include many spurious suffixes that do not start on an integer boundary (so each 4-byte sequence would produce a random nonsense value in this array).

In sary, you can set the index points, e.g. to every 4th byte. So that should work. My idea was to use the vertical file and put an index point at each line start, but your idea is clearly better.

> If you can convince Sary to work on full 32-bit integers as symbols, then 32-bit offsets should at least be able to handle a corpus of 2^31 = 2 billion tokens -- which is the design limit of CWB corpora anyway.

Makes perfect sense! I think I'll try that out.

Viele Grüße,
Paul