[Sigwac] 2nd Call for Participation: EmpiriST Shared Task

Stefan Evert stefanML at collocations.de
Mon Feb 1 23:47:34 CET 2016


2nd Call for Participation: EmpiriST Shared Task on Processing German CMC/Social Media & Web Data

	https://sites.google.com/site/empirist2015/


UPDATED SCHEDULE

20.12.2015        Release of the training data
*14.02.2016*      Extended deadline for team registration
15.02.2016        Release of the evaluation data for the tokenization subtask
19.02.2016        Submission deadline for the tokenization subtask
22.02.2016        Release of the evaluation data for the POS-tagging subtask
26.02.2016        Submission deadline for the POS-tagging subtask
*08.05.2016*      Submission of system description papers (4 pages + references)
12.08.2016        Presentation of systems and task results at WAC-X workshop (ACL 2016, Berlin)

Note that a postponed schedule for the evaluation period was temporarily
shown on the task Web site by mistake. The correct schedule is as shown
above, with evaluation taking place from Feb 19th to Feb 26th.

REGISTRATION

In order to register as a competitor for EmpiriST 2015, please send a message to empirist at collocations.de containing the following information:

 - Team name (will be used to identify submissions)
 - Name(s) of team member(s)
 - Affiliation(s)
 - Subtasks you plan to participate in (CMC Tok, CMC PoS, Web Tok, Web PoS)
 - Contact person and e-mail address

Task participants should also join our Google group at https://groups.google.com/d/forum/empirist2015

DETAILS

The EmpiriST 2015 shared task aims to encourage the developers of NLP
applications to adapt their tools and resources for the processing of
written German discourse in genres of computer-mediated communication
(CMC) – such as chats, forums, wiki talk pages, tweets, blog comments,
social networks, SMS and WhatsApp dialogues – as well as monological
web pages – such as personal or professional blogs, Wikipedia
articles, academic sites, etc.

The shared task is divided into two subtasks (A: tokenization, B: POS
tagging) and two different data sets (CMC subset, web corpora subset).
While our main goal is to foster the development of robust tools that
work well on a wide range of CMC & web genres, teams are allowed to
focus on one subtask or one subset only. Full manually annotated
training data are available now on the EmpiriST homepage, comprising
approx. 5000 tokens for each subset.

Results and system descriptions will be presented in the WAC-X
workshop co-located with ACL 2016 in Berlin, Germany (11 or 12 August
2016).

For more information, including detailed annotation guidelines and
instructions for participation, see the EmpiriST homepage at

       https://sites.google.com/site/empirist2015/

and join our Google group for updates, questions and discussion:

       https://groups.google.com/d/forum/empirist2015

While EmpiriST is focussed on the annotation of German-language data,
familiarity with German is not essential for participating in the
task. There are sufficient amounts of training data for general
machine learning, domain adaptation and optimization approaches. We
also provide an English summary of the POS tagset and annotation
guidelines.

TASK FORCE

CMC data set:
* Michael Beißwenger (Technische Universität Dortmund)
* Kay-Michael Würzner (Berlin-Brandenburgische Akademie der Wissenschaften)

Web corpora data set:
* Sabine Bartsch (Technische Universität Darmstadt)
* Stefan Evert (Universität Erlangen-Nürnberg)

Contact address:
 empirist at collocations.de


More information about the Sigwac mailing list