[Sigwac] How to retrieve and parse the results of google

Eros Zanchetta eros at sslmit.unibo.it
Thu Mar 25 10:53:01 CET 2010


Hi there,

if I understand correctly what you're trying to do, it looks like the 
BootCaT tools might help you (http://bootcat.sslmit.unibo.it/).

Regards,
Eros Zanchetta

On 25/03/2010 09:33, Ya Wang wrote:
> Hello,
>
> Currently we are working in the project collecting documents from the
> Internet. A query is sent to google and the highly ranked pages need
> to be downloaded and saved to our corpus. I have searched the Internet
> for this information for some days but I can't find a tool for this.
>
> For the first step, there are some possible tools with which I can
> send a query to google and get the response URL list. However, there
> is no way to parse the HTML pages and it's even harder to remove the
> noise from the web pages (i.e., advertisements). I don't know if
> anyone here has this kind of experience.
>
> Thanks a lot in advance.
>
> Best regards,
> Ce
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>
>    




More information about the Sigwac mailing list