[CWB] Huffman compression for very large corpora (ca. 2 billion words)

Sat Jan 14 14:18:17 CET 2012

Dear all, but Andrew in particular,

as you may have seen from the bugtracker updates, I've just done something very naughty: I added a patch to cwb-huffcode that changes the way optimal Huffman codes are computed, in order to avoid failures that occasionally happen for very large corpora (in particular, my slightly truncated version of UKWAC, which contains 2.147 billion tokens and falls only 1289 tokens short of CWB's design limit :).

The underlying cause was that cwb-huffcode would sometimes generate code words with a length of 32 bits under these circumstances, but the CWB internals _and_ the binary index file format limit code words to a maximum of 31 bits.  My solution is to add-one to all frequency counts, which reduces the theoretical optimal code word length for hapax legomena from 31 bits to 30 bits and thus should make sure oversized code words are never generated.

While this changes the Huffman codes computed by cwb-huffcode, it does _not_ affect the file formats, and indexed CWB corpora should be fully compatible between older and newer versions.  Optimality of the encoding should also not be affected substantially.

However, I don't understand the entire algorithm well enough to be entirely sure that my change doesn't have any unexpected side effects -- so please, please, please give the new cwb-huffcode a thorough round of testing if you're working with such large corpora!

I've checked the patch into the sf.net SVN in branches/3.0 and trunk/, bumping the version number of the latter to 3.4.2. in order to facilitate compatibility testing.

Have a nice weekend everyone!
Stefan