[Sigwac] Microsoft Web N-gram dataset

Adam Kilgarriff adam at lexmasterclass.com
Wed May 12 08:04:30 CEST 2010


See
http://research.microsoft.com/webngram

<http://research.microsoft.com/webngram>Looks interesting, and it's nice to
have the biggies (google and microsoft) competing to give us the nicest
resource!

Has anyone tried it yet, and/or will SIGWAC be represented at the workshop
(which is part of SIGIR)?  (I'll attend the paper at NAACL)

adam

---------- Forwarded message ----------
From: Evelyne Viegas <evelynev at microsoft.com>
Date: 11 May 2010 19:55
Subject: [Sigwac] CFP - SIGIR 2010 Web N-gram Workshop - Submission deadline
June 11, 2010
To: "sigwac at sslmit.unibo.it" <sigwac at sslmit.unibo.it>


Web N-gram Workshop
Call for Papers
July 23, 2010 - Geneva, Switzerland
http://research.microsoft.com/webngram
Submission Deadline: June 11, 2010

This workshop will bring together leaders in information retrieval and
language modeling to discuss the challenges in information retrieval and how
language modeling approaches may help address some of these challenges. We
will focus on the use of n-gram models to further research in areas such as
document representation and content analysis, query analysis, retrieval
models and ranking, and spelling, as well as the access to n-grams as an
enabler of experimental design.
Workshop Aims
The aim of the workshop is to bring together a group of leaders in
information retrieval and language modeling to discuss the challenges in
information retrieval and how language modeling approaches may help address
some of these challenges. At the workshop, we will focus on the use of
n-gram models to further research in areas such as document representation
and content analysis (e.g., clustering, classification, information
extraction), query analysis (e.g., query suggestion, query reformulation),
retrieval models and ranking, and spelling as well as the access to n-grams
as an enabler of experimental design.
Often discussed in the research community is the lack of large-scale dataset
and benchmarks to run experiments. This workshop will address this issue by
bringing together the community of researchers who use n-grams, already made
available by Yahoo and Google/LDC along with a new Web N-gram service
through which Microsoft Research, in partnership with Microsoft Bing, is
providing the research community access to petabytes of Web N-gram via a
cloud-based platform.
The Web N-gram services directly address the data need by enabling the
community of researchers to create data benchmarks for repeatable
experiments, and by enabling the research community to be at the forefront
of inventions based on real-world, large-scale data.
The Microsoft Web N-gram services, currently in Beta<
http://research.microsoft.com/web-ngram>, will be made available to
participants upon request.
Previous efforts of delivering n-grams to the research community adopted a
data release approach with a cut off on the n-gram counts that obfuscate the
long tail effects, an issue this service-based approach makes possible for
further studies. Moreover, previous efforts also focused on just the
document body; whereas richer types of textual contents are included in the
Web N-gram service that can engage researchers in new innovations.
Another notable difference is the scale: the Web N-gram service provides
access to petabytes of data via services-up to two orders of magnitude
greater than currently available offerings. Finally, by providing regular
data refresh, the Web N-gram service can open up new research directions in
fields where lack of dynamic data has locked academic researchers into
conducting research over static and stale data sets.
Topics
We are now requesting paper submissions for the Web N-gram Workshop.
We encourage researchers to use the Microsoft Web N-gram services to explore
novel applications of language models (e.g., long tail effects) and use of
these data for enhancing the search experience (e.g., use of anchor text as
a proxy to queries). We will also consider other applications such as
machine translation and speech.
If you would like to use the Microsoft Web N-gram services in preparation of
your paper, send an e-mail message to webngram at microsoft.com<mailto:
webngram at microsoft.com> to request access.
We also encourage research and experiments using or comparing different
n-grams data sets to ultimately help create, at the workshop, a useful
n-gram baseline for the research community, in terms of n-gram attributes
such as size, access, content, and model types needed for researchers.
For more information, see Submissions<
http://research.microsoft.com/en-us/events/webngram/submissions.aspx>.
Planned Activities
As part of the workshop, experiment results will be presented via talks
(average of 15 minutes per talk, plus 5 minutes of questions and answers)
and with posters and/or demo sessions. In addition, there will be a panel
discussion on providing access to data, with a focus on academia needs,
challenges, and opportunities for industries to provide such data.

Important Dates

 *   Paper submissions due: June 11, 2010
 *   Notifications sent to authors: June 28, 2010
 *   Camera-ready papers due: July 9, 2010
 *   Full-day workshop: July 23, 2010

Organizing Committee
*    Chengxiang Zhai, University of Illinois at Urbana-Champaign
*    David Yarowsky, Johns Hopkins University
*    Evelyne Viegas, Microsoft Research
*    Kuansan Wang, Microsoft Research
*    Stephan Vogel, Carnegie Mellon University
Programme Committee
*    Eytan Adar, University of Michigan
*    Eugene Agichtein, Emory University
*    Thorsten Brants, Google Research
*    Jaime Callan, Carnegie Mellon University
*    Kevin Chang, University of Illinois at Urbana Champaign
*    Ken Church, Johns Hopkins University
*    Charlie Clarke, University of Waterloo
*    Bruce Croft, University of Massachusetts Amherst
*    Nick Craswell, Microsoft
*    Brian Davison, Lehigh University
*    Bill Dolan, Microsoft Research
*    George Dupret, Yahoo! Research
*    Efthimis N. Efthimiadis, University of Washington
*    Michael Gamon, Microsoft Research
*    Alistair Moffat University of Melbourne
*    Emmanuel Prochasson, Hong Kong University of Science & Technology
*    Jian-Tao Sun, Microsoft Research Asia
*    Amanda Spink, Loughborough University
*    Jurgen Van Gael, University of Cambridge
*    Evelyne Viegas, Microsoft Research
*    Stefan Vogel, Carnegie Mellon University
*    Peng Xu, Google Research
*    David Yarowsky, John Hopkins University
*    Hongyuan Zha, Georgia Institute of Technology
*    Chengxiang Zhai, University of Illinois at Urbana-Champaign



_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac



-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================


More information about the Sigwac mailing list