[Sigwac] Spiderling crawler

Adam Kilgarriff adam at lexmasterclass.com
Tue Jun 5 08:32:05 CEST 2012


All,

Following presentation of the new spiderling linguistic crawler - work by
Jan Pomikalek and Vit Suchomel, presentations by Pavel Rychly and me - at
SIGWAC-7 in Lyon and at LREC in Istanbul, several people asked about its
availability.

There is quite a bit more work to do to package it up, so that it works
'out of the box'.  We are hoping to get it ready for distribution (probably
via Google code) in December.

I claimed "a billion words a day": sometimes we are able to build 1b-words
of cleaned, deduplicated text suitable for linguistic research in a day.
 This is dependent on factors including:

* the language: it works much faster for big languages than for small (and
for very smaller languages, BootCaT-style methods may do better)
* Internet connection. We have been running it on the Czech academic network
with 2,5 Gb/s connection to USA and several slower links to Europe.
* CPU: the crawler is able to fully utilize ca. 12 CPU cores (for big
languages),
mainly for embedded post-processing
* Memory: the crawler is not optimized to save memory, as we did not need
to. We have used up to 300 GB of operational memory during large crawls.

Adam Kilgarriff and Vit Suchomel

-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================




-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================


More information about the Sigwac mailing list