[Sigwac] How to retrieve and parse the results of google

Ya Wang hustwangce at googlemail.com
Thu Mar 25 09:33:00 CET 2010


Hello,

Currently we are working in the project collecting documents from the
Internet. A query is sent to google and the highly ranked pages need
to be downloaded and saved to our corpus. I have searched the Internet
for this information for some days but I can't find a tool for this.

For the first step, there are some possible tools with which I can
send a query to google and get the response URL list. However, there
is no way to parse the HTML pages and it's even harder to remove the
noise from the web pages (i.e., advertisements). I don't know if
anyone here has this kind of experience.

Thanks a lot in advance.

Best regards,
Ce


More information about the Sigwac mailing list