[CWB] latin file that gives an error

Thu Oct 11 17:43:43 CEST 2012

2012/10/11 Stefan Evert <stefanML at collocations.de>

> Hi Eva,
>
> Can you tell us exactly what operating system and version you're using,
> and how you have obtained and installed CWB?  If you're using a
> pre-compiled binary, please tell us which version you've downloaded.
>
>
I'm using a server: Debian GNU/Linux 6.0
We downloaded the beta version (cwb-3.4.1 )
I have created several corpus in several languages, and I never got this
problem.

> We also need to know exactly which commands you entered to index and
> compress the corpus, plus the output from each of these commands.  Perhaps
> this will allow us to make a guess at the source of the error.
>
>
the command I use is:
 cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d $DATADIR
-R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s  -S doc:0+type+title -S
not:0+text

This is the output (after correcting the errors you mentioned):

=== Makeall: processing corpus LATIN ===
Registry directory: /B_NFS_P/resources/corpora/written/registry/
ATTRIBUTE word
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
ATTRIBUTE lema
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
ATTRIBUTE pos
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
========================================
COMPRESSING TOKEN STREAM of LATIN.word
- writing code descriptor block to
/B_NFS_P/resources/corpora/written/data/latin/word.hcd
- writing compressed item sequence to
/B_NFS_P/resources/corpora/written/data/latin/word.huf
- writing sync (every 128 tokens) to
/B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
VALIDATING LATIN.word
- reading code descriptor block from
/B_NFS_P/resources/corpora/written/data/latin/word.hcd
- reading compressed item sequence from
/B_NFS_P/resources/corpora/written/data/latin/word.huf
- reading sync (mod 128) from
/B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus> now.
COMPRESSING TOKEN STREAM of LATIN.lema
Error: Huffman codes too long (33 bits, current maximum is 31 bits).
       Please contact the CWB development team for assistance.
COMPRESSING INDEX of LATIN.word
- writing compressed index to
/B_NFS_P/resources/corpora/written/data/latin/word.crc
- writing compressed index offsets to
/B_NFS_P/resources/corpora/written/data/latin/word.crx
VALIDATING LATIN.word
- reading compressed index from
/B_NFS_P/resources/corpora/written/data/latin/word.crc
- reading compressed index offsets from
/B_NFS_P/resources/corpora/written/data/latin/word.crx
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev> now.
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx> now.
COMPRESSING INDEX of LATIN.lema
- writing compressed index to
/B_NFS_P/resources/corpora/written/data/latin/lema.crc
- writing compressed index offsets to
/B_NFS_P/resources/corpora/written/data/latin/lema.crx
CL: index is out of range: (aborting) token frequency == 0

=== Makeall: processing corpus LATIN ===
Registry directory: /B_NFS_P/resources/corpora/written/registry/
ATTRIBUTE word
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
ATTRIBUTE lema
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
ATTRIBUTE pos
 - lexicon      OK
 - frequencies  OK
 - token stream OK (COMPRESSED)
 - index        OK (COMPRESSED)
========================================

Thanks

Eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121011/0763d78c/attachment.html>