[CWB] cwb-huffcode error

Wed Oct 20 09:06:34 CEST 2010

After a quick shufty at the code....

>>>> The minimal length = 100 and maximal length = 0 may be init values,
indicating that the main loop didn't run.

That's exactly the case.

362	  hc->min_codelen = 100;
363	  hc->max_codelen = 0;

and then later

567	  if ((hc->max_codelen == 0) && (hc->min_codelen == 100)) {
568	
569	    fprintf(stderr, "Problem: No output generated -- no
items?\n");
570	    nr_codes = 0;
571	
572	  }

And actual compression of the attribute happens in the else of that if.

The loop in which they're set away from their init values has this
condition:

for (i = 0; i < hc->size; i++)

hc->size in this case is 1. But there is a continue; condition mid-loop
which, if met, would increment i to 1 and ensure that hc->min_codelen
and hc->max_codelen are never reset, thus triggering the empty-attribute
detector further down the line.

>>>> Does anybody have time to take a closer look at the source code and
figure out what the Huffman algorithm does _exactly_?  

I would do, but binary compression scares me.

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
On Behalf Of Stefan Evert
Sent: 20 October 2010 07:11
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] cwb-huffcode error

Hi!

I'm afraid this may be related to a known problem for which I haven't
found a solution yet.

> I have a problem with cwb-huffcode. It works fine until I try to
compress an attribute which has the same value for all tokens, when I
get the following error:
> 
> $ cwb-huffcode -v -P text minisuc
> 
> COMPRESSING TOKEN STREAM of minisuc.text
> Allocated heap with 2 cells for 1 items
> 
> Minimal code length: 100
> Maximal code length:   0
> Compressed code len:          0 bits,          0 (+1) bytes
> 
> Problem: No output generated -- no items?

I've run into a complementary problem before with some very large
corpora.  The cwb-huffode implementation limits Huffman codes to 31 bits
(or so).  While this should in theory be enough even for a type that
occurs just once in a 2-billion-word corpus (with perfect entropy
coding), the Huffman algorithm in cwb-huffcode sometimes produces longer
codes, causing the compression to fail.

I tried to fix this, but couldn't make enough sense of the
implementation to see where to make such a change and to be confident
there are no side effects (such as breaking compatibility with the
current file format).

If your problem is related to this, it will be difficult to fix.  But
perhaps it's just an edge case that isn't handled properly.  One
possibility: the main loop of the Huffman algorithm is skipped because
no merges have to be made, so it fails to generate a sensible code.  The
minimal length = 100 and maximal length = 0 may be init values,
indicating that the main loop didn't run.

On the other hand, if you just have a single type in the corpus, the
optimal code _should_ use log2(1) = 0 bits ...

Does anybody have time to take a closer look at the source code and
figure out what the Huffman algorithm does _exactly_?  Documentation of
the binary file formats would also be appreciated -- currently, the only
complete "documentation" is the implementation in the source code (very
much like Google's WebM :)

Best,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb