[Sigwac] How to retrieve and parse the results of google

Fri Mar 26 06:10:36 CET 2010

Thanks a lot for all the useful suggestions!

Best regards,
Wang

2010/3/26 Desilets, Alain <Alain.Desilets at nrc-cnrc.gc.ca>:
> We use the Yahoo API. It is available in various languages. We use the Perl version:
>
> http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm
>
> It works like a charm, and you don't need to do any parsing.
>
> Alain
>
> -----Original Message-----
> From: sigwac-bounces at sslmit.unibo.it [mailto:sigwac-bounces at sslmit.unibo.it] On Behalf Of Ya Wang
> Sent: March-25-10 4:33 AM
> To: sigwac at sslmit.unibo.it
> Subject: [Sigwac] How to retrieve and parse the results of google
>
> Hello,
>
> Currently we are working in the project collecting documents from the
> Internet. A query is sent to google and the highly ranked pages need
> to be downloaded and saved to our corpus. I have searched the Internet
> for this information for some days but I can't find a tool for this.
>
> For the first step, there are some possible tools with which I can
> send a query to google and get the response URL list. However, there
> is no way to parse the HTML pages and it's even harder to remove the
> noise from the web pages (i.e., advertisements). I don't know if
> anyone here has this kind of experience.
>
> Thanks a lot in advance.
>
> Best regards,
> Ce
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>