[Sigwac] How to retrieve and parse the results of google

Adi Eyal adi at burgercom.co.za
Thu Mar 25 13:05:36 CET 2010


You shouldn't scrape google because you'll find yourself on their
blacklist very quickly.

Try using their search api


http://code.google.com/apis/ajaxsearch/documentation/

e.g.
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Paris%20Hilton

Adi Eyal

On 25 March 2010 10:33, Ya Wang <hustwangce at googlemail.com> wrote:
> Hello,
>
> Currently we are working in the project collecting documents from the
> Internet. A query is sent to google and the highly ranked pages need
> to be downloaded and saved to our corpus. I have searched the Internet
> for this information for some days but I can't find a tool for this.
>
> For the first step, there are some possible tools with which I can
> send a query to google and get the response URL list. However, there
> is no way to parse the HTML pages and it's even harder to remove the
> noise from the web pages (i.e., advertisements). I don't know if
> anyone here has this kind of experience.
>
> Thanks a lot in advance.
>
> Best regards,
> Ce
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>


More information about the Sigwac mailing list