[CWB] [ cwb-Bugs-2929062 ] cwb-huffcode may fail because code length limit is exceeded

Sat Jan 14 12:12:22 CET 2012

Bugs item #2929062, was opened at 2010-01-09 17:40
Message generated for change (Comment added) made by schtepf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2929062&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Command-line utilities
>Group: TODO-3.5
Status: Open
Resolution: None
>Priority: 6
Private: No
Submitted By: Stefan Evert (schtepf)
Assigned to: Stefan Evert (schtepf)
Summary: cwb-huffcode may fail because code length limit is exceeded

Initial Comment:
Huffman codes may exceed maximal allowed code length of 31 bits in some extreme cases, which was not checked in cwb-huffcode prior to January 2010 and could lead to buffer overflows and segmentation faults.  The program now aborts with an error message, but the underlying problem has not been fixed yet.

It is NOT POSSIBLE just to increase MAXCODELEN<cl/attributes.h>, as this changes the index file format and breaks compatibility with previously encoded corpora.  The only good solution is to patch the Huffman code generation so that it computes a suboptimal code in this cases that stays within the allowed limit, but for this somebody needs to go through and understand the entire code in compute_code_lengths()<utils/cwb-huffcode.c>.

This bug does not have a very high priority, as it seems to happen only under very extreme circumstances.  So far, it has been noticed only once for a 2.1 billion word corpus (very close to the CWB size limit!) and an attribute with highly skewed distribution (dependency offset pointers).

----------------------------------------------------------------------

>Comment By: Stefan Evert (schtepf)
Date: 2012-01-14 03:12

Message:
Update: problem also occurred with a lexical attribute in the same 2.1
billion word corpus. The problem seems to be that hapax legomena in such a
corpus have a theoretical optimal code length of approx. 31 bits, but the
Huffman tree building algorithm will occasionally end up with a few 32 bit
codes.

A short-term patch that does not break compatibility with the binary file
format is therefore needed if we want to have full support for corpus sizes
up to the 2G limit in release 3.5. 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2929062&group_id=131809