[Sigwac] How to retrieve and parse the results of google

Desilets, Alain Alain.Desilets at nrc-cnrc.gc.ca
Thu Mar 25 18:46:08 CET 2010


We use the Yahoo API. It is available in various languages. We use the Perl version:

http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm

It works like a charm, and you don't need to do any parsing.

Alain

-----Original Message-----
From: sigwac-bounces at sslmit.unibo.it [mailto:sigwac-bounces at sslmit.unibo.it] On Behalf Of Ya Wang
Sent: March-25-10 4:33 AM
To: sigwac at sslmit.unibo.it
Subject: [Sigwac] How to retrieve and parse the results of google

Hello,

Currently we are working in the project collecting documents from the
Internet. A query is sent to google and the highly ranked pages need
to be downloaded and saved to our corpus. I have searched the Internet
for this information for some days but I can't find a tool for this.

For the first step, there are some possible tools with which I can
send a query to google and get the response URL list. However, there
is no way to parse the HTML pages and it's even harder to remove the
noise from the web pages (i.e., advertisements). I don't know if
anyone here has this kind of experience.

Thanks a lot in advance.

Best regards,
Ce
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac


More information about the Sigwac mailing list