[CWB] Empty fields

Stefan Evert stefanML at collocations.de
Tue Aug 6 12:47:07 CEST 2013


Hi Maarten,

no one seems to have picked up on this report so far -- it's a known problem in the cwb-huffcode algorithm, which breaks if there's only a single type in an attribute.

It should indeed be possible to fix this, probably by inserting a special case for Huffman trees with just a single entry, but someone would have to go carefully through the code in utils/cwb-huffcode.c and understand all the details of the implementation.  Moreover, thorough tests are required to make sure this doesn't break the Huffman encoding for normal attributes.

Any takers?  This is a fairly self-contained task, so it should be possible to do this even without knowing much about CWB internals.  Documentation of the algorithm would also be highly appreciated.

Cheers,
Stefan



On 23 Jan 2013, at 13:24, Maarten Janssen <maartenpt at gmail.com> wrote:

> Hi all,
> 
> There is a "bug" in CQP 3.0.0 that probably hardly ever comes up in the typical use of CQP: when you have a column in your .vrt file for which all lines have an empty value (_ - as in the case of the fourth column in the example below), cwb-encode runs fine, but cwb-make quits after complaining about the empty column. To be precise, CWB::Indexer stops because cwb-huffcode fails (Problem: No output generated -- no items?), after which cwb-make chokes on the missing .huf file.
> 
> Now normally, there is no reason to have empty columns, and if there are any it is normally easy to remove them. But in the context in which I am using it, these files are built on-the-fly, and it is difficult to predict beforehand which columns are used and not. Does anyone know if there is any way to use a flag to ignore a column, like you do with -p for the word attribute? And/or has this quit-on-empty-column been solved in later versions of CQP? It would be a relatively easy change to make (the missing .huf file should not be crucial), but I'd rather not get into it if it has already been dealt with somehow.
> 
> kind regards,
> 
> Maarten Janssen
> 
> This	this	PRON	_	1
> is	be	VRB	_	2
> a	a	DET	_	3
> little	little	ADJ	_	4
> test	test	NOUN	_	5
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list