[Sigwac] bootcats / large crawls
Marco Baroni
baroni at sslmit.unibo.it
Mon Aug 28 13:08:08 CEST 2006
> The default OOB settings won't let you get much beyond 100Gb before
> becoming inefficient, there are *lots* of tuning points to make Heritrix
> scale properly, including the underlying Java engine. IMHO, this tuning
> for scale requirement isn't unique to Heritrix by any means.
The mystery is that last year, with 1.4, we had been able to crawl about
350Gb of data (just text, so actually visiting much more than that) in a
few weeks, whereas right now, with 1.6 and 1.8, we get a lot less data
(from the very beginning, way before passing the 100Gb mark). Do you have
any concrete suggestions on good settings to improve efficiency?
Thanks.
Regards,
Marco
More information about the Sigwac
mailing list