[Sigwac] Legal issues with crawling

Thu Aug 31 14:52:09 CEST 2006

Dear readers,

Does any one know of, or perhaps conducted themselves, any studies
regarding the legal issues of web crawling?

It seems to me that many sites not only have copyright notices, but also
terms and conditions relating to fair usage of a given site.

There is the Robots Exclusion Protocol whereby savvy web administrators
place a robots.txt file in the domain root that provide rules about what
a robot can and cannot do. But this is clearly an opt-in approach and
doesn't mean that sites without such a file is implying that crawlers
can have free access to the content within.

Some sites may be happy to permit crawlers for online search engines, as
they are often benefical for creating Internet traffic. But the same
sites may not agree to crawling for off-line purposes. Such conditions
could be expressed by T&Cs, even though robots.txt leaves the site
wide-open for crawlers to index.

If one wishes to create a private corpus by crawling the web, how do you 
do it whilst guaranteeing conformance to T&Cs? It seems that one needs
to compile a directory of sites with T&Cs permissive enough to allow
this type of crawling, and limit WAC crawlers to only extract data from
those sites.

Can anyone offer any advice on this issue?

Many thanks,
Andy