[Sigwac] Re: bootcats / large crawls

Andy Roberts andyr at comp.leeds.ac.uk
Sun Aug 27 23:52:30 CEST 2006


On Sun, 27 Aug 2006, Marco Baroni wrote:

> My impression is that Serge's Perl version, at the moment, has some more 
> advanced features (such as heuristic boilerplate stripping), but Andy's 
> version is definitely more user-friendly, so I would use Serge's version for 
> my own research, but Andy's version with the students...
>
> There is also a Web BootCaT, developed  by Jan Pomikalek, that is or will 
> become part of the services offered by the Word Sketch Engine, I think.
>

I think one of the key issues is improving the degree of choice that's
available to people in the WAC domain. I'm the first to admit that
jBootCat is the least functional version available at the moment,
although I hope that will improve in the near future. But even in this
case, I, like Marco (I guess), am quite fond of the low-level, more
technical approach to many tasks - I'm just drawn to *making*
front-ends!

Whether it be a set of Perl/Python/<insert your language here> scripts,
a WWW online version or a desktop GUI, each have pros and cons, and
therefore each will appeal to different users/environments.

> Somewhat related -- or not: has anybody on this list experience with large 
> crawls of the Web (in the order of  hundreds of GB of data)? Which crawler do 
> you use? Last year, I did some happy crawling with heritrix, but recent 
> versions have become mysteriously slow, so I started wondering if there are 
> other open source alternatives...
>

You could try experimenting with Nutch: http://lucene.apache.org/nutch.
It's a full-blown web search engine, of which the crawler is (obivously) 
one of its components. I haven't tried it personally - yet!

Regards,
Andy


More information about the Sigwac mailing list