[CWB] Huffman code error

Stefan Evert stefanML at collocations.de
Wed Oct 10 15:35:56 CEST 2012


> I have the feeling this bug has come up before

It has, but AFAIR this was in the context of very large corpora (> 1.5 billion words) and has to do with a deficiency in the CWB binary file format, so it cannot be fixed in a backward-compatible way.

> – can you check your version? (cqp –v)

The path indicates CWB 3.4.1, which seems to be rather ancient and will contain a lot of bugs that have been fixed in the meantime.

For what it's worth, I tried the sample input file included in the e-mail with CWB 3.4.3 and 3.4.5 on my Mac and wasn't able to reproduce the error.

Two observations, though:

1) The sample file in the e-mail has only 35 tokens, not 40 tokens as claimed.  So perhaps this is a cut-down version that doesn't trigger the error?

2) When copying & pasting from the e-mail, I end up with 4 blanks as column separators rather than the required TABs, which I edited before encoding, of course.  If I use blanks instead of TABs, cwb-huffcode will fail, of course, because the attributes "lema" and "pos" are empty.  However, this produces a different error message from the one reported.

Best
Stefan




More information about the CWB mailing list