[Sigwac] Legal issues with crawling
Andy Roberts
andyr at comp.leeds.ac.uk
Thu Aug 31 14:52:09 CEST 2006
Dear readers,
Does any one know of, or perhaps conducted themselves, any studies
regarding the legal issues of web crawling?
It seems to me that many sites not only have copyright notices, but also
terms and conditions relating to fair usage of a given site.
There is the Robots Exclusion Protocol whereby savvy web administrators
place a robots.txt file in the domain root that provide rules about what
a robot can and cannot do. But this is clearly an opt-in approach and
doesn't mean that sites without such a file is implying that crawlers
can have free access to the content within.
Some sites may be happy to permit crawlers for online search engines, as
they are often benefical for creating Internet traffic. But the same
sites may not agree to crawling for off-line purposes. Such conditions
could be expressed by T&Cs, even though robots.txt leaves the site
wide-open for crawlers to index.
If one wishes to create a private corpus by crawling the web, how do you
do it whilst guaranteeing conformance to T&Cs? It seems that one needs
to compile a directory of sites with T&Cs permissive enough to allow
this type of crawling, and limit WAC crawlers to only extract data from
those sites.
Can anyone offer any advice on this issue?
Many thanks,
Andy
More information about the Sigwac
mailing list