[CWB] latin file that gives an error

BOFÍAS ALBERCH, EVA eva.bofias at upf.edu
Tue Oct 16 12:03:54 CEST 2012


Hi Stefan,
I got confused about the version between the server and my computer.
I was using an old Scipt that called cwb-makeall, cwb-huffcode,
cwb-compress-rdx (and the last cwb-makeall that I think it was there to
check that everything was ok.)
I changed to cwb-make and now it works. So the error must have been related
to old files as you pointed out.

Many thanks for your help

Eva Bofias

2012/10/12 Stefan Evert <stefanML at collocations.de>

>
> On 11 Oct 2012, at 17:43, BOFÍAS ALBERCH, EVA wrote:
>
> > I'm using a server: Debian GNU/Linux 6.0
> > We downloaded the beta version (cwb-3.4.1 )
>
> In a previous mail you stated that you're using CWB 3.0.2 -- is it
> possible that you've mixed up two different versions? However, file formats
> should be fully compatible between 3.0.x and 3.4.x, so this is unlikely to
> be the cause of your problems.
>
>



> > We also need to know exactly which commands you entered to index and
> compress the corpus, plus the output from each of these commands.  Perhaps
> this will allow us to make a guess at the source of the error.
> >
> > the command I use is:
> >  cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d
> $DATADIR -R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s  -S
> doc:0+type+title -S not:0+text
>
> That can't be all you're doing.
>
> For one thing, you need to define the shell variables SOURCEFILE, DATADIR,
> etc. for this command to do anything sensible.
>
> More importantly, this command only runs cwb-encode, which is the first
> step of the indexing process.  You still need to run cwb-makeall (to build
> the actual index structures) and cwb-huffcode and cwb-compress-rdx (to
> compress the index files, which is where your error occurs).
>
> The output you sent us (as shown below) stems from these programs, so you
> must be running those additional commands in some way!
>
> There are two strange things about the output:
>
> 1) You seem to run cwb-makeall twice, once before compressing and once
> after.  There's no need to run cwb-makeall a second time -- why do you do
> that?
>
> 2) The output from the first cwb-makeall run indicates that the index
> structures have already been created _and_ compressed (it just says "OK"
> rather than "creating ...").  Those might be stale, damaged files from a
> previous encoding run.   Did you forget to clean the data directory
> /B_NFS_P/resources/corpora/written/data/latin/ before re-running
> cwb-encode?  It's quite possible that your error is due  to damaged index
> files still lying around ...
>
> By the way, this is a good reason why you should use cwb-make from the
> CWB/Perl modules rather than calling cwb-makeall etc. directly.  cwb-make
> would recognise that they index files are out of date and automatically
> delete and rebuild them.
>
> Best,
> Stefan
>
>
> >
> > This is the output (after correcting the errors you mentioned):
> >
> > === Makeall: processing corpus LATIN ===
> > Registry directory: /B_NFS_P/resources/corpora/written/registry/
> > ATTRIBUTE word
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ATTRIBUTE lema
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ATTRIBUTE pos
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ========================================
> > COMPRESSING TOKEN STREAM of LATIN.word
> > - writing code descriptor block to
> /B_NFS_P/resources/corpora/written/data/latin/word.hcd
> > - writing compressed item sequence to
> /B_NFS_P/resources/corpora/written/data/latin/word.huf
> > - writing sync (every 128 tokens) to
> /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
> > VALIDATING LATIN.word
> > - reading code descriptor block from
> /B_NFS_P/resources/corpora/written/data/latin/word.hcd
> > - reading compressed item sequence from
> /B_NFS_P/resources/corpora/written/data/latin/word.huf
> > - reading sync (mod 128) from
> /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn
> > !! You can delete the file
> </B_NFS_P/resources/corpora/written/data/latin/word.corpus> now.
> > COMPRESSING TOKEN STREAM of LATIN.lema
> > Error: Huffman codes too long (33 bits, current maximum is 31 bits).
> >        Please contact the CWB development team for assistance.
> > COMPRESSING INDEX of LATIN.word
> > - writing compressed index to
> /B_NFS_P/resources/corpora/written/data/latin/word.crc
> > - writing compressed index offsets to
> /B_NFS_P/resources/corpora/written/data/latin/word.crx
> > VALIDATING LATIN.word
> > - reading compressed index from
> /B_NFS_P/resources/corpora/written/data/latin/word.crc
> > - reading compressed index offsets from
> /B_NFS_P/resources/corpora/written/data/latin/word.crx
> > !! You can delete the file
> </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev> now.
> > !! You can delete the file
> </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx> now.
> > COMPRESSING INDEX of LATIN.lema
> > - writing compressed index to
> /B_NFS_P/resources/corpora/written/data/latin/lema.crc
> > - writing compressed index offsets to
> /B_NFS_P/resources/corpora/written/data/latin/lema.crx
> > CL: index is out of range: (aborting) token frequency == 0
> >
> > === Makeall: processing corpus LATIN ===
> > Registry directory: /B_NFS_P/resources/corpora/written/registry/
> > ATTRIBUTE word
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ATTRIBUTE lema
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ATTRIBUTE pos
> >  - lexicon      OK
> >  - frequencies  OK
> >  - token stream OK (COMPRESSED)
> >  - index        OK (COMPRESSED)
> > ========================================
> >
> > Thanks
> >
> > Eva
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121016/5c4db284/attachment.html>


More information about the CWB mailing list