[CWB] cwb-huffcode error

Stefan Evert stefanML at COLLOCATIONS.DE
Wed Oct 20 08:11:17 CEST 2010


Hi!

I'm afraid this may be related to a known problem for which I haven't found a solution yet.

> I have a problem with cwb-huffcode. It works fine until I try to compress an attribute which has the same value for all tokens, when I get the following error:
> 
> $ cwb-huffcode -v -P text minisuc
> 
> COMPRESSING TOKEN STREAM of minisuc.text
> Allocated heap with 2 cells for 1 items
> 
> Minimal code length: 100
> Maximal code length:   0
> Compressed code len:          0 bits,          0 (+1) bytes
> 
> Problem: No output generated -- no items?

I've run into a complementary problem before with some very large corpora.  The cwb-huffode implementation limits Huffman codes to 31 bits (or so).  While this should in theory be enough even for a type that occurs just once in a 2-billion-word corpus (with perfect entropy coding), the Huffman algorithm in cwb-huffcode sometimes produces longer codes, causing the compression to fail.

I tried to fix this, but couldn't make enough sense of the implementation to see where to make such a change and to be confident there are no side effects (such as breaking compatibility with the current file format).

If your problem is related to this, it will be difficult to fix.  But perhaps it's just an edge case that isn't handled properly.  One possibility: the main loop of the Huffman algorithm is skipped because no merges have to be made, so it fails to generate a sensible code.  The minimal length = 100 and maximal length = 0 may be init values, indicating that the main loop didn't run.

On the other hand, if you just have a single type in the corpus, the optimal code _should_ use log2(1) = 0 bits ...

Does anybody have time to take a closer look at the source code and figure out what the Huffman algorithm does _exactly_?  Documentation of the binary file formats would also be appreciated -- currently, the only complete "documentation" is the implementation in the source code (very much like Google's WebM :)

Best,
Stefan





More information about the CWB mailing list