[Sigwac] How to retrieve and parse the results of google

Adam Kilgarriff adam at lexmasterclass.com
Thu Mar 25 11:02:28 CET 2010


And one more hint: our experience is that Yahoo and Bing both have more
helpful terms of use for their APIs, even if their indexes are smaller, so
you might want to use them instead (we do)

Adam

On 25 March 2010 09:53, Eros Zanchetta <eros at sslmit.unibo.it> wrote:

> Hi there,
>
> if I understand correctly what you're trying to do, it looks like the
> BootCaT tools might help you (http://bootcat.sslmit.unibo.it/).
>
> Regards,
> Eros Zanchetta
>
>
> On 25/03/2010 09:33, Ya Wang wrote:
>
>> Hello,
>>
>> Currently we are working in the project collecting documents from the
>> Internet. A query is sent to google and the highly ranked pages need
>> to be downloaded and saved to our corpus. I have searched the Internet
>> for this information for some days but I can't find a tool for this.
>>
>> For the first step, there are some possible tools with which I can
>> send a query to google and get the response URL list. However, there
>> is no way to parse the HTML pages and it's even harder to remove the
>> noise from the web pages (i.e., advertisements). I don't know if
>> anyone here has this kind of experience.
>>
>> Thanks a lot in advance.
>>
>> Best regards,
>> Ce
>> _______________________________________________
>> Sigwac mailing list
>> Sigwac at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>>
>>
>>
>
>
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>



-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================


More information about the Sigwac mailing list