[Sigwac] How to retrieve and parse the results of google

Desilets, Alain Alain.Desilets at nrc-cnrc.gc.ca
Tue Apr 6 15:10:14 CEST 2010


Hi Adam,

I am in the process of evaluating the terms of use of different search API, and I have to say that all of them (including Yahoo) seem to be very restrictive. Can you elaborate on your understanding of the TOU of 
Yahoo and Bing vs Google?

 
Looking at the terms of Yahoo, I see this:

---
You are permitted to use the Services only for the purpose of incorporating and displaying Web Search Results from such Services as part of a Search Product deployed on your Web site ("Your Offering"). A "Search Product" means a service which provides a response to a request in the form of a search query, keyword, term or phrase (each request, a "Query") served from an index or indexes of data related to Web pages generated, in whole or in part, by the application of an algorithmic search engine.
---

Which seems pretty restrictive. I suspect most folks on this list use this API in a way that strays from this particular use case.

I forget the terms of the Google API, but they seemed pretty similar. The reason we choose Yahoo over Google was that the Perl API for Yahoo seemed easier to use than the one for Google. 



Alain

-----Original Message-----
From: sigwac-bounces at sslmit.unibo.it [mailto:sigwac-bounces at sslmit.unibo.it] On Behalf Of Adam Kilgarriff
Sent: March-25-10 6:02 AM
To: sigwac at sslmit.unibo.it
Cc: hustwangce at googlemail.com
Subject: Re: [Sigwac] How to retrieve and parse the results of google

And one more hint: our experience is that Yahoo and Bing both have more
helpful terms of use for their APIs, even if their indexes are smaller, so
you might want to use them instead (we do)

Adam

On 25 March 2010 09:53, Eros Zanchetta <eros at sslmit.unibo.it> wrote:

> Hi there,
>
> if I understand correctly what you're trying to do, it looks like the
> BootCaT tools might help you (http://bootcat.sslmit.unibo.it/).
>
> Regards,
> Eros Zanchetta
>
>
> On 25/03/2010 09:33, Ya Wang wrote:
>
>> Hello,
>>
>> Currently we are working in the project collecting documents from the
>> Internet. A query is sent to google and the highly ranked pages need
>> to be downloaded and saved to our corpus. I have searched the Internet
>> for this information for some days but I can't find a tool for this.
>>
>> For the first step, there are some possible tools with which I can
>> send a query to google and get the response URL list. However, there
>> is no way to parse the HTML pages and it's even harder to remove the
>> noise from the web pages (i.e., advertisements). I don't know if
>> anyone here has this kind of experience.
>>
>> Thanks a lot in advance.
>>
>> Best regards,
>> Ce
>> _______________________________________________
>> Sigwac mailing list
>> Sigwac at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>>
>>
>>
>
>
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>



-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac


More information about the Sigwac mailing list