[CWB] latin file that gives an error
BOFÍAS ALBERCH, EVA
eva.bofias at upf.edu
Thu Oct 11 17:43:43 CEST 2012
2012/10/11 Stefan Evert <stefanML at collocations.de>
> Hi Eva,
>
> Can you tell us exactly what operating system and version you're using,
> and how you have obtained and installed CWB? If you're using a
> pre-compiled binary, please tell us which version you've downloaded.
>
>
I'm using a server: Debian GNU/Linux 6.0
We downloaded the beta version (cwb-3.4.1 )
I have created several corpus in several languages, and I never got this
problem.
> We also need to know exactly which commands you entered to index and
> compress the corpus, plus the output from each of these commands. Perhaps
> this will allow us to make a guess at the source of the error.
>
>
the command I use is:
cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d $DATADIR
-R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s -S doc:0+type+title -S
not:0+text
This is the output (after correcting the errors you mentioned):
=== Makeall: processing corpus LATIN ===
Registry directory: /B_NFS_P/resources/corpora/written/registry/
ATTRIBUTE word
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
ATTRIBUTE lema
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
ATTRIBUTE pos
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
========================================
COMPRESSING TOKEN STREAM of LATIN.word
- writing code descriptor block to
/B_NFS_P/resources/corpora/written/data/latin/word.hcd
- writing compressed item sequence to
/B_NFS_P/resources/corpora/written/data/latin/word.huf
- writing sync (every 128 tokens) to
/B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
VALIDATING LATIN.word
- reading code descriptor block from
/B_NFS_P/resources/corpora/written/data/latin/word.hcd
- reading compressed item sequence from
/B_NFS_P/resources/corpora/written/data/latin/word.huf
- reading sync (mod 128) from
/B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus> now.
COMPRESSING TOKEN STREAM of LATIN.lema
Error: Huffman codes too long (33 bits, current maximum is 31 bits).
Please contact the CWB development team for assistance.
COMPRESSING INDEX of LATIN.word
- writing compressed index to
/B_NFS_P/resources/corpora/written/data/latin/word.crc
- writing compressed index offsets to
/B_NFS_P/resources/corpora/written/data/latin/word.crx
VALIDATING LATIN.word
- reading compressed index from
/B_NFS_P/resources/corpora/written/data/latin/word.crc
- reading compressed index offsets from
/B_NFS_P/resources/corpora/written/data/latin/word.crx
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev> now.
!! You can delete the file
</B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx> now.
COMPRESSING INDEX of LATIN.lema
- writing compressed index to
/B_NFS_P/resources/corpora/written/data/latin/lema.crc
- writing compressed index offsets to
/B_NFS_P/resources/corpora/written/data/latin/lema.crx
CL: index is out of range: (aborting) token frequency == 0
=== Makeall: processing corpus LATIN ===
Registry directory: /B_NFS_P/resources/corpora/written/registry/
ATTRIBUTE word
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
ATTRIBUTE lema
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
ATTRIBUTE pos
- lexicon OK
- frequencies OK
- token stream OK (COMPRESSED)
- index OK (COMPRESSED)
========================================
Thanks
Eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121011/0763d78c/attachment.html>
More information about the CWB
mailing list