[Sigwac] bootcats / large crawls
Baden Hughes
badenh at csse.unimelb.edu.au
Mon Aug 28 03:08:30 CEST 2006
> Somewhat related -- or not: has anybody on this list experience with large
> crawls of the Web (in the order of hundreds of GB of data)? Which crawler do
> you use? Last year, I did some happy crawling with heritrix, but recent
> versions have become mysteriously slow, so I started wondering if there are
> other open source alternatives...
We're pulling hundreds of Gb to 1-2 Tb of web data with Heritrix (1.6 in
production but working with migrating to 1.10 (bleeding edge source
repository version). We're using the Apache Hadoop
platform to distrubute crawls across a number of agents (= CPUs on the
same machine and different machines).
The default OOB settings won't let you get much beyond 100Gb before
becoming inefficient, there are
*lots* of tuning points to make Heritrix scale properly, including the
underlying Java engine. IMHO, this tuning for scale requirement isn't
unique to Heritrix by any means.
Other people I know are using UbiCrawler, but likewise, the OOB
experience only gets you so far before more extensive use case tuning is
required.
Baden
More information about the Sigwac
mailing list