[Sigwac] ATA-AMTA Workshop on Users and Uses for Parallel Corpora

Rockwood, Trent R trockwood at mitre.org
Mon Jun 14 20:54:28 CEST 2010


Call for Papers:
Workshop:  Uses and Users for Parallel Corpora in the Translation Process
Association for Machine Translation in the Americas (AMTA)
November 4-5, 2010 Denver, Colorado
(in conjunction with the American Translators Association Conference)

The purpose of this workshop is to explore the uses that the translation community is currently applying, and will apply, to parallel corpora.  A parallel corpus generally refers to a large collection of translated text.  These texts are often aligned at the sentence or phrase level and annotated with a specific task in mind, motivating a markup schema.  Bilingual parallel texts are referred to as bitext, whereas parallel corpora can be multilingual (e.g. the many translations of the Bible.)

Submissions will address and explore the many reasons why people create corpora, what corpora they would like to see created, how translators are making use of corpora, how translations systems are utilizing corpora according to type and structure, and what the privacy and copyright issues are which accompany the many uses, both by machine and by people.

Collections of parallel corpora abound, whereas definitions and structuring of corpora seem to vary across sites[1].  Examples of the kinds of differences involve source text markup, transliteration, target text markup, methods of associating source and target, and alignment.

Processing needed for different applications varies widely according to context and function; for example, how granular do associations between source and target need to be, how much tagging needs to occur (morphological, syntactic, semantic), what types of alignment are needed for which purposes, and how much of the markup is manual or automatic.  Furthermore, given the wide range of preprocessing needs, what is the quality check process as part of the overall workflow?

When using corpora to aid in human translation, especially in conjunction with Translation Memory software, which  representation standards are being, or should be applied ( for example tmx, tbx, srx, xml:tm, etc) and what are some of the compatibility issues encountered.

Finally, what are the standard existing uses for various kinds of parallel corpora, and what are some of the nascent needs that could only be explored once massive amounts of data are collected.  Some of these uses and users may simply need smaller amounts of data, but still require backup corpora for validation and extension of data.  What can translators expect from parallel corpora?  Of what use are these resources for others in the translation industry, be it government or industry or academia.

Two of the issues addressed only gingerly in the translation community are those of privacy and permissions for copyrighted text, particularly when dealing with limited extraction of say technical terms and their translations that could in no way be used to reconstruct the sources. A liberal interpretation might claim that it does not constitute an invasion of privacy (in corpora that consist of logs, chats, emails, etc), nor is it an infringement of copyright. On the other hand, a more conservative interpretation of privacy or infringement might claim that this use does constitute misuse. Most people either overlook these issues or are blocked from progress with the more cautious approach.

The types of questions that this workshop will address include:

*         How you create parallel corpora as part of your workflow?

*         In what ways do you use parallel corpora?

*         What techniques do you use to evaluate usefulness (for people or systems)?

*         How are your corpora processed (Aligned?  Markup? Standards?)

*         What kinds of quality ratings do you use?

*         What are your lessons learned?


Proposers are encouraged to participate by sharing their experiences, projects, needs and findings as a single contributor or as a member of a panel. Of interest is the workflow process of creating or finding, processing, standardizing and using parallel or comparable corpora for improving language training of humans and machines.  Furthermore, if participants have developed a financial return on investment scenario for using parallel corpora, those insights and justifications are also welcome as presentation topics.

Organizers:

Judith L. Klavans - U.S. Government and University of Maryland
Elizabeth McGrath - MITRE Corporation
Trent Rockwood - MITRE Corporation

Important Dates and Schedule:

June 10, 2010 - send out call for papers
July 20, 2010 - papers due
August 20, 2010 - send reviews back to submitters
September 10, 2010 - revisions due back to AMTA for printing
November 4-5, 2010 - workshop dates

Format
6 page max, 11pt minimum, 2 column, ACM format:
http://www.acm.org/sigs/publications/proceedings-templates

Submissions and questions to: Trent Rockwood, trockwood at mitre.org<mailto:trockwood at mitre.org>



________________________________

[1] A few examples of major collections include the Linguistic Data Consortium (www.ldc.upenn.edu<http://www.ldc.upenn.edu>),  the British National Corpus (bnc.org), the JRC-Acquis corpora http://wt.jrc.it/lt/Acquis/,  the ELRA MLCC Multilingual and Parallel Corpora http://catalog.elra.info, Japanese-Chinese corpora www.nict.go.jp/<http://www.nict.go.jp/>, and many others.


More information about the Sigwac mailing list