[Sigwac] bootcats / large crawls

Marco Baroni baroni at sslmit.unibo.it
Mon Aug 28 13:08:08 CEST 2006


> The default OOB settings won't let you get much beyond 100Gb before 
> becoming inefficient, there are *lots* of tuning points to make Heritrix 
> scale properly, including the underlying Java engine. IMHO, this tuning 
> for scale requirement isn't unique to Heritrix by any means.

The mystery  is that last  year, with 1.4, we had been able to crawl about 
350Gb of data (just text, so actually visiting much more than that) in a 
few weeks, whereas right now, with 1.6 and 1.8, we get a lot less data 
(from the very beginning, way before passing the 100Gb mark). Do you have 
any concrete suggestions on  good settings to improve efficiency?

Thanks.

Regards,

Marco


More information about the Sigwac mailing list