[Sigwac] Microsoft Web N-gram dataset

Bill Fletcher whfletcher at verizon.net
Wed May 12 15:59:05 CEST 2010


Hi all,

I first saw this announcement on the SIGIR list several weeks ago.  I have tried it, with varying success.  The two papers at

http://research.microsoft.com/en-us/collaboration/focus/cs/bingiton.aspx

show interesting applications of the data.

The description of the service is more a blueprint for future development, e.g.

1. still based on a year-old dataset (rather than constantly refreshed as promised)

2. largest value of n for body text is 3 (vice 4 for titles and anchors)

3. to me the difference between "probability" and "conditional probability" is unclear (can anyone explain it to me, the math retard?)

4. service is intermittent -- I have had it return 500 internal server error on evenings and weekends, even with their sample code (C#)

5. only documents classified by Bing as en-us are included, but they do include other languages -- see German and Chinese examples in the papers

We're in the middle of semester exams and I haven't had a chance to work with it enough to get a paper together by the deadline.

I agree wholeheartedly with Adam that it's great to see the biggies competing for our attention -- just hope they make a long-term commitment rather than getting us hooked and pulling the rug out from under us with no warning!

I have written some PHP code to query it.  I'll polish and post it when I get a chance. For some reason the PHP SOAP extensions and XML parsers don't agree with MS' formats -- the request confuses the server, and the response won't load in PHP.  I had to run their C# sample code and examine the successful conversations with a packet sniffer, then write my own workarounds in PHP.  I'd like to add a query interface to webascorpus.org when the system is more reliable.

Bill

  - - - -

Date: Wed, 12 May 2010 07:04:30 +0100
From: Adam Kilgarriff<adam at lexmasterclass.com>
Subject: [Sigwac] Microsoft Web N-gram dataset
To: sigwac at sslmit.unibo.it

See
http://research.microsoft.com/webngram

<http://research.microsoft.com/webngram>Looks interesting, and it's nice to
have the biggies (google and microsoft) competing to give us the nicest
resource!

Has anyone tried it yet, and/or will SIGWAC be represented at the workshop
(which is part of SIGIR)?  (I'll attend the paper at NAACL)

adam



More information about the Sigwac mailing list