[CWB] [ cwb-Bugs-2929062 ] cwb-huffcode may fail because code length limit is exceeded

SourceForge.net noreply at sourceforge.net
Mon Aug 1 01:03:39 CEST 2011


Bugs item #2929062, was opened at 2010-01-10 01:40
Message generated for change (Settings changed) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2929062&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Command-line utilities
>Group: TODO-4.0
Status: Open
Resolution: None
Priority: 4
Private: No
Submitted By: Stefan Evert (schtepf)
>Assigned to: Stefan Evert (schtepf)
Summary: cwb-huffcode may fail because code length limit is exceeded

Initial Comment:
Huffman codes may exceed maximal allowed code length of 31 bits in some extreme cases, which was not checked in cwb-huffcode prior to January 2010 and could lead to buffer overflows and segmentation faults.  The program now aborts with an error message, but the underlying problem has not been fixed yet.

It is NOT POSSIBLE just to increase MAXCODELEN<cl/attributes.h>, as this changes the index file format and breaks compatibility with previously encoded corpora.  The only good solution is to patch the Huffman code generation so that it computes a suboptimal code in this cases that stays within the allowed limit, but for this somebody needs to go through and understand the entire code in compute_code_lengths()<utils/cwb-huffcode.c>.

This bug does not have a very high priority, as it seems to happen only under very extreme circumstances.  So far, it has been noticed only once for a 2.1 billion word corpus (very close to the CWB size limit!) and an attribute with highly skewed distribution (dependency offset pointers).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2929062&group_id=131809


More information about the CWB mailing list