[Sigwac] bootcats / large crawls

Mon Aug 28 03:08:30 CEST 2006

> Somewhat related -- or not: has anybody on this list experience with large 
> crawls of the Web (in the order of  hundreds of GB of data)? Which crawler do 
> you use? Last year, I did some happy crawling with heritrix, but recent 
> versions have become mysteriously slow, so I started wondering if there are 
> other open source alternatives...

We're pulling hundreds of Gb to 1-2 Tb of web data with Heritrix (1.6 in 
production but working with migrating to 1.10 (bleeding edge source 
repository version). We're using the Apache Hadoop 
platform to distrubute crawls across a number of agents (= CPUs on the 
same machine and different machines).
The default OOB settings won't let you get much beyond 100Gb before 
becoming inefficient, there are 
*lots* of tuning points to make Heritrix scale properly, including the 
underlying Java engine. IMHO, this tuning for scale requirement isn't 
unique to Heritrix by any means.

Other people I know are using UbiCrawler, but likewise, the OOB 
experience only gets you so far before more extensive use case tuning is 
required.

Baden