[Sigwac] bootcats / large crawls
Marco Baroni
baroni at sslmit.unibo.it
Sun Aug 27 23:19:21 CEST 2006
My impression is that Serge's Perl version, at the moment, has some more
advanced features (such as heuristic boilerplate stripping), but Andy's
version is definitely more user-friendly, so I would use Serge's version
for my own research, but Andy's version with the students...
There is also a Web BootCaT, developed by Jan Pomikalek, that is or will
become part of the services offered by the Word Sketch Engine, I think.
> rather than yet another comment on the low volume of discussion, here's
> something to discuss: should we be using original Perl BootCat,
> or ports to Python (compatible with NLTK) or Java (compatible with
> lots of other stuff, eg aConCorde concordancer) ???
Somewhat related -- or not: has anybody on this list experience with large
crawls of the Web (in the order of hundreds of GB of data)? Which crawler
do you use? Last year, I did some happy crawling with heritrix, but recent
versions have become mysteriously slow, so I started wondering if there are
other open source alternatives...
Regards,
Marco
More information about the Sigwac
mailing list