[Sigwac] CfP: 12th Web as Corpus Workshop @ LREC 2020 in Marseille, France

Sat Jan 18 22:04:42 CET 2020

WAC-XII is endorsed by the Special Interest Group of the ACL on Web as
Corpus (SIGWAC)

Workshop website: https://www.sigwac.org.uk/wiki/WAC-XII

*Description*

For almost fifteen years, the ACL SIGWAC, and most notably the Web as
Corpus (WAC) workshops, have served as a platform for researchers
interested in the compilation, processing and use of web-derived corpora
as well as computer-mediated communication. Past workshops were
co-located with major conferences on corpus linguistics and/or
computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC,
NAACL, WWW).

In corpus/theoretical linguistics, the World Wide Web has become
increasingly popular as a source of linguistic evidence, especially in
the face of data sparseness or the lack of variation in traditional
corpora of written language. In lexicography, web data have become a
major and well-established resource with dedicated research data and
specialised tools. In other areas of theoretical linguistics, the
adoption rate of web corpora has been slower but steady. Furthermore,
some completely new areas of linguistic research dealing exclusively
with web (or similar) data have emerged, such as the construction and
utilisation of corpora based on short messages. Another example is the
(manual or automatic) classification of web texts by genre, register, or
– more generally speaking – “text type”, as well as topic area. In
computational linguistics, web corpora have become an established source
of data for the creation of language models, word embeddings, and for
all types of machine learning.

The twelfth Web as Corpus workshop (WAC-XII) looks at the past, present,
and future of web corpora given the fact that large web corpora are
nowadays provided mostly by a few major initiatives and/or companies,
and the diversity of the early years appears to have faded slightly.
Also, we acknowledge the fact that alternative sources of data (such as
data from Twitter and similar platforms) have emerged, some of them only
available to large companies and their affiliates, such as linguistic
data from social media and other forms of the deep web. At the same
time, gathering interesting and/or relevant web data (web crawling) is
becoming an ever more intricate task as the nature of the data offered
on the web changes (for example the death of forums in favour of more
closed platforms).

We intend WAC-XII to be a platform for the discussion of some
fundamental issues in current web corpus construction. Some of the key
issues that we see for the future of web corpora are:

- Can the requirements of all of the aforementioned groups of users
(theoretical linguists, lexicographers, computational linguists, etc.)
be met by the same type of web corpora, or should web corpora be
tailored to the specific needs of different groups of users?
- How has the composition of the web (and subsequently that of web
corpora) changed? Are web data still as relevant and interesting as they
were fifteen years ago?
- What is the impact of changes in web data production (e.g., CMS and
microtexts published on more restricted platforms), and how can it be
addressed in the data collection process?
- Is there still an interest in fundamental research on the linguistic
nature and composition of the web?
- What is the level of quality of web data relative to the
abovementioned tasks to be performed with web data?

*Call for papers*

The twelfth Web as Corpus workshop (WAC-XII) aims to unite (web) corpus
creators and all types of (web) corpus users from corpus/theoretical
linguistics, computational linguistics, cognitive science, etc. We
invite papers dealing with the fundamental questions mentioned above. In
addition, we invite papers dealing with the whole range of applied and
fundamental topics from both corpus/theoretical linguistic and
computational linguistics which have characterised WAC workshops,
including but not limited to:

- Data selection and collection (discovery and/or crawling)
- Linguistic post-processing of web data
- Analysis of web corpora (assessment of the distribution of genres,
registers, topics, etc.)
- Comparison of web corpus data with other types of corpus data
(traditional corpora, linguistic data from social media, etc.)
- Case studies in corpus/theoretical or computational linguistics where
web data have been used
- Case studies in digital lexicography, for example using SketchEngine?
- Research specifically related to the validity of web data in
corpus/theoretical and computational linguistics
- Web data in psycholinguistic research and cognitive modelling
- Web corpora for language models and word embeddings

*Format and submission*

Like LREC 2020, WAC-XII asks for full papers from 4 pages to 8 pages
(plus more pages for references if needed) , which must strictly follow
the LREC stylesheet available on the LREC 2020 website. No distinction
between long and short papers will be made, but papers should have an
appropriate length given their content. Appropriate time slots for oral
presentations will be allocated according to the length of each paper.
Papers must be submitted through START [URL tba] and will undergo blind
peer-review.
All papers will be published in the LREC 2020 proceedings.

*Important dates*

Submission deadline: Sunday, 16 February 2020 at 24:00 GMT-12
Notification of acceptance: Friday, 13 March 2020 at 22:00 GMT+1
Camera-ready manuscript due date: Friday, 27 March 2020 at 24:00 GMT-12
Workshop date: afternoon session of Saturday, 16 May 2020

* Organizers*

Adrien Barbaresi (BBAW Berlin)
Felix Bildhauer (IDS Mannheim)
Roland Schäfer (Humboldt-Universität zu Berlin, SFB 1412)
Egon Stemle (Eurac Research)