On Organizing Information: June 2014

I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across.

LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for on machine translation, crowdsourcing and grammar correction.

In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets.

Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters.

The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work:

Machine Translation

Aligning Parallel Texts with InterText: Pavel Vondřička

A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.

The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: Ahmed Abdelali, Francisco Guzman, Hassan Sajjad and Stephan Vogel

The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages.

Machine Translationness: Machine-likeness in Machine Translation Evaluation: Joaquim Moré and Salvador Climent

Machine Translation for Subtitling: A Large-Scale Evaluation -- Thierry Etchegoyhen, Lindsay Bywood, Mark Fishel, Panayota Georgakopoulou, Jie Jiang, Gerard Van Loenhout, Arantza Del Pozo, Mirjam Sepesy Maucec, Anja Turner and Martin Volk

Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality.

On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: Joke Daems, Lieve Macken and Sonia Vandepitte

Translation Errors from English to Portuguese: an Annotated Corpus: Angela Costa , Tiago Luís and Luísa Coheur

A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated.

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: Sharid Loaiciga, Thomas Meyer and Andrei Popescu-Belis

The taraXÜ Corpus of Human-Annotated Machine Translations - Eleftherios Avramidis, Aljoscha Burchardt, Sabine Hunsicker, Maja Popović, Cindy Tscherwinka, David Vilar and Hans Uszkoreit

CFT13: a Resource for Research into the Post-editing Process - Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao

HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation - Ondrej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Stranak, Vit Suchomel, Aleš Tamchyna and Daniel Zeman

A Corpus of Machine Translation Errors Extracted from Translation Students Exercises - Guillaume Wisniewski, Natalie Kübler and François Yvon

Innovations in Parallel Corpus Search Tools - Martin Volk, Johannes Graën and Elena Callegaro

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer - Timur Gilmanov, Olga Scrivner and Sandra Kübler

GRAMMAR CORRECTION

The MERLIN corpus: Learner Language and the CEFR: Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne,Barbora Štindlová and Chiara Vettori

The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds:

Metadata about the author and the test
Test ratings according to the CEFR framework
Error annotations
Annotations to encourage second language acquisition research

Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed.

KoKo: an L1 Learner Corpus for German: andrea Abel, Aivars Glaznieks, Lionel Nicolas* and Egon Stemle

A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development.

Building a Reference Lexicon for Countability in English: Tibor Kiss, Francis Jeffry Pelletier and Tobias Stadtfeld

The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach

Large Scale Arabic Error Annotation: Guidelines and Framework: Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani and Kemal Oflazer

Learner corpora for Arabic as L2

A Comparison of MT Errors and ESL Errors - Homa B. Hashemi and Rebecca Hwa

Crowdsourcing

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl

Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages - manjira sinha, Tirthankar Dasgupta and Anupam Basu

Collaboration in the Production of a Massively Multilingual Lexicon - Martin Benjamin

Online Experiments with the Percy Software Framework - Experiences and some Early Results - Christoph Draxler

sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl

Some interesting papers

A Database for Measuring Linguistic Information Content - Richard Sproat, Bruno Cartoni, Hyunjeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar and Evelyn Wenzel-Grondie
Developing Politeness Annotated Corpus of Hindi Blogs - Ritesh Kumar
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus - Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan and Ann Sawyer
The Ellogon Pattern Engine: Context-free Grammars over Annotations - Georgios Petasis
Etymological WordNet: Tracing the History of Words - Gerard de Mello
Distributed Distributional Similarities of Google Books over the Centuries - Martin Riedl, Richard Steuer and Chris Biemann
Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings - Paul Buitelaar, Georgeta Bordea and Barry Coughlan
Linguistic Landscaping of South Asia using Digital Language Resources: Genetic vs. Areal Linguistics - Lars Borin, Anju Saxena, Taraka Rama and Bernard Comrie
Indian Subcontinent Language Vitalization - András Kornai and Pushpak Bhattacharyya

On Organizing Information

Wednesday, June 4, 2014

LREC 2014 - Some paper/posters