[Sigwac] bootcats / large crawls

Marco Baroni baroni at sslmit.unibo.it
Sun Aug 27 23:19:21 CEST 2006


My impression is that Serge's Perl version, at the moment, has some more 
advanced features (such as heuristic boilerplate stripping), but Andy's 
version is definitely more user-friendly, so I would use Serge's version 
for my own research, but Andy's version with the students...

There is also a Web BootCaT, developed  by Jan Pomikalek, that is or will 
become part of the services offered by the Word Sketch Engine, I think.

> rather than yet another comment on the low volume of discussion, here's 
> something to discuss: should we be using original Perl BootCat,
> or ports to Python (compatible with NLTK) or Java (compatible with
> lots of other stuff, eg aConCorde concordancer) ???

Somewhat related -- or not: has anybody on this list experience with large 
crawls of the Web (in the order of  hundreds of GB of data)? Which crawler 
do you use? Last year, I did some happy crawling with heritrix, but recent 
versions have become mysteriously slow, so I started wondering if there are 
other open source alternatives...

Regards,

Marco



More information about the Sigwac mailing list