[CWB] latin file that gives an error

Fri Oct 12 01:38:50 CEST 2012

On 11 Oct 2012, at 17:43, BOFÍAS ALBERCH, EVA wrote:

> I'm using a server: Debian GNU/Linux 6.0
> We downloaded the beta version (cwb-3.4.1 )

In a previous mail you stated that you're using CWB 3.0.2 -- is it possible that you've mixed up two different versions? However, file formats should be fully compatible between 3.0.x and 3.4.x, so this is unlikely to be the cause of your problems.

> We also need to know exactly which commands you entered to index and compress the corpus, plus the output from each of these commands.  Perhaps this will allow us to make a guess at the source of the error.
> 
> the command I use is:
>  cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d $DATADIR -R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s  -S doc:0+type+title -S not:0+text

That can't be all you're doing.

For one thing, you need to define the shell variables SOURCEFILE, DATADIR, etc. for this command to do anything sensible.

More importantly, this command only runs cwb-encode, which is the first step of the indexing process.  You still need to run cwb-makeall (to build the actual index structures) and cwb-huffcode and cwb-compress-rdx (to compress the index files, which is where your error occurs).

The output you sent us (as shown below) stems from these programs, so you must be running those additional commands in some way!  

There are two strange things about the output:

1) You seem to run cwb-makeall twice, once before compressing and once after.  There's no need to run cwb-makeall a second time -- why do you do that?

2) The output from the first cwb-makeall run indicates that the index structures have already been created _and_ compressed (it just says "OK" rather than "creating ...").  Those might be stale, damaged files from a previous encoding run.   Did you forget to clean the data directory /B_NFS_P/resources/corpora/written/data/latin/ before re-running cwb-encode?  It's quite possible that your error is due  to damaged index files still lying around ...

By the way, this is a good reason why you should use cwb-make from the CWB/Perl modules rather than calling cwb-makeall etc. directly.  cwb-make would recognise that they index files are out of date and automatically delete and rebuild them.

Best,
Stefan

> 
> This is the output (after correcting the errors you mentioned):
> 
> === Makeall: processing corpus LATIN ===
> Registry directory: /B_NFS_P/resources/corpora/written/registry/
> ATTRIBUTE word
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ATTRIBUTE lema
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ATTRIBUTE pos
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ========================================
> COMPRESSING TOKEN STREAM of LATIN.word
> - writing code descriptor block to /B_NFS_P/resources/corpora/written/data/latin/word.hcd
> - writing compressed item sequence to /B_NFS_P/resources/corpora/written/data/latin/word.huf
> - writing sync (every 128 tokens) to /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
> VALIDATING LATIN.word
> - reading code descriptor block from /B_NFS_P/resources/corpora/written/data/latin/word.hcd
> - reading compressed item sequence from /B_NFS_P/resources/corpora/written/data/latin/word.huf
> - reading sync (mod 128) from /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus> now.
> COMPRESSING TOKEN STREAM of LATIN.lema
> Error: Huffman codes too long (33 bits, current maximum is 31 bits).
>        Please contact the CWB development team for assistance.
> COMPRESSING INDEX of LATIN.word
> - writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/word.crc
> - writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/word.crx
> VALIDATING LATIN.word
> - reading compressed index from /B_NFS_P/resources/corpora/written/data/latin/word.crc
> - reading compressed index offsets from /B_NFS_P/resources/corpora/written/data/latin/word.crx
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev> now.
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx> now.
> COMPRESSING INDEX of LATIN.lema
> - writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/lema.crc
> - writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/lema.crx
> CL: index is out of range: (aborting) token frequency == 0
> 
> === Makeall: processing corpus LATIN ===
> Registry directory: /B_NFS_P/resources/corpora/written/registry/
> ATTRIBUTE word
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ATTRIBUTE lema
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ATTRIBUTE pos
>  - lexicon      OK
>  - frequencies  OK
>  - token stream OK (COMPRESSED)
>  - index        OK (COMPRESSED)
> ========================================
> 
> Thanks 
> 
> Eva
>