[Sigwac] How to retrieve and parse the results of google

Achim Ruopp achimru at gmail.com
Wed Apr 7 17:37:45 CEST 2010


For those of you in the academic community there is Google's
University Research Program for Google Search
http://research.google.com/university/search/

Achim

On Tue, Apr 6, 2010 at 9:10 AM, Desilets, Alain
<Alain.Desilets at nrc-cnrc.gc.ca> wrote:
> Hi Adam,
>
> I am in the process of evaluating the terms of use of different search API, and I have to say that all of them (including Yahoo) seem to be very restrictive. Can you elaborate on your understanding of the TOU of
> Yahoo and Bing vs Google?
>
>
> Looking at the terms of Yahoo, I see this:
>
> ---
> You are permitted to use the Services only for the purpose of incorporating and displaying Web Search Results from such Services as part of a Search Product deployed on your Web site ("Your Offering"). A "Search Product" means a service which provides a response to a request in the form of a search query, keyword, term or phrase (each request, a "Query") served from an index or indexes of data related to Web pages generated, in whole or in part, by the application of an algorithmic search engine.
> ---
>
> Which seems pretty restrictive. I suspect most folks on this list use this API in a way that strays from this particular use case.
>
> I forget the terms of the Google API, but they seemed pretty similar. The reason we choose Yahoo over Google was that the Perl API for Yahoo seemed easier to use than the one for Google.
>
>
>
> Alain
>
> -----Original Message-----
> From: sigwac-bounces at sslmit.unibo.it [mailto:sigwac-bounces at sslmit.unibo.it] On Behalf Of Adam Kilgarriff
> Sent: March-25-10 6:02 AM
> To: sigwac at sslmit.unibo.it
> Cc: hustwangce at googlemail.com
> Subject: Re: [Sigwac] How to retrieve and parse the results of google
>
> And one more hint: our experience is that Yahoo and Bing both have more
> helpful terms of use for their APIs, even if their indexes are smaller, so
> you might want to use them instead (we do)
>
> Adam
>
> On 25 March 2010 09:53, Eros Zanchetta <eros at sslmit.unibo.it> wrote:
>
>> Hi there,
>>
>> if I understand correctly what you're trying to do, it looks like the
>> BootCaT tools might help you (http://bootcat.sslmit.unibo.it/).
>>
>> Regards,
>> Eros Zanchetta
>>
>>
>> On 25/03/2010 09:33, Ya Wang wrote:
>>
>>> Hello,
>>>
>>> Currently we are working in the project collecting documents from the
>>> Internet. A query is sent to google and the highly ranked pages need
>>> to be downloaded and saved to our corpus. I have searched the Internet
>>> for this information for some days but I can't find a tool for this.
>>>
>>> For the first step, there are some possible tools with which I can
>>> send a query to google and get the response URL list. However, there
>>> is no way to parse the HTML pages and it's even harder to remove the
>>> noise from the web pages (i.e., advertisements). I don't know if
>>> anyone here has this kind of experience.
>>>
>>> Thanks a lot in advance.
>>>
>>> Best regards,
>>> Ce
>>> _______________________________________________
>>> Sigwac mailing list
>>> Sigwac at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Sigwac mailing list
>> Sigwac at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>>
>
>
>
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>


More information about the Sigwac mailing list