[Sigwac] Legal issues with crawling

Eric Atwell eric at comp.leeds.ac.uk
Sat Sep 2 21:33:05 CEST 2006


Andy,

Debbie Elliott collected a corpus of French computing-related texts 
whcih explicitly included statements (in French) granting permission to copy
the text - her forthcoming PhD thesis shows how she went about
filtering web-crawl results to include only "permitted" texts

(or talk to her direct to find out the details :-)

eric

On Thu, 31 Aug 2006, Andy Roberts wrote:

> Dear readers,
>
> Does any one know of, or perhaps conducted themselves, any studies
> regarding the legal issues of web crawling?
>
> It seems to me that many sites not only have copyright notices, but also
> terms and conditions relating to fair usage of a given site.
>
> There is the Robots Exclusion Protocol whereby savvy web administrators
> place a robots.txt file in the domain root that provide rules about what
> a robot can and cannot do. But this is clearly an opt-in approach and
> doesn't mean that sites without such a file is implying that crawlers
> can have free access to the content within.
>
> Some sites may be happy to permit crawlers for online search engines, as
> they are often benefical for creating Internet traffic. But the same
> sites may not agree to crawling for off-line purposes. Such conditions
> could be expressed by T&Cs, even though robots.txt leaves the site
> wide-open for crawlers to index.
>
> If one wishes to create a private corpus by crawling the web, how do you do 
> it whilst guaranteeing conformance to T&Cs? It seems that one needs
> to compile a directory of sites with T&Cs permissive enough to allow
> this type of crawling, and limit WAC crawlers to only extract data from
> those sites.
>
> Can anyone offer any advice on this issue?
>
> Many thanks,
> Andy
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>

-- 
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-3435430  FAX: +44-113-3435468  http://www.comp.leeds.ac.uk/eric


More information about the Sigwac mailing list