I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across.
LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for on machine translation, crowdsourcing and grammar correction.
In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets.
Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters.
The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work:
Machine Translation
Aligning Parallel Texts with InterText: Pavel Vondřička
A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.
Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.
Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.
The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: Ahmed Abdelali, Francisco Guzman, Hassan Sajjad and Stephan Vogel
The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages.
Machine Translationness: Machine-likeness in Machine Translation Evaluation: Joaquim Moré and Salvador Climent
Machine Translation for Subtitling: A Large-Scale Evaluation -- Thierry Etchegoyhen, Lindsay Bywood, Mark Fishel, Panayota Georgakopoulou, Jie Jiang, Gerard Van Loenhout, Arantza Del Pozo, Mirjam Sepesy Maucec, Anja Turner and Martin Volk
Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality.
On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: Joke Daems, Lieve Macken and Sonia Vandepitte
Translation Errors from English to Portuguese: an Annotated Corpus: Angela Costa, Tiago Luís and Luísa Coheur
A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated.
English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: Sharid Loaiciga, Thomas Meyer and Andrei Popescu-Belis
The taraXÜ Corpus of Human-Annotated Machine Translations - Eleftherios Avramidis, Aljoscha Burchardt, Sabine Hunsicker, Maja Popović, Cindy Tscherwinka, David Vilar and Hans Uszkoreit
CFT13: a Resource for Research into the Post-editing Process - Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao
HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation - Ondrej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Stranak, Vit Suchomel, Aleš Tamchyna and Daniel Zeman
A Corpus of Machine Translation Errors Extracted from Translation Students Exercises - Guillaume Wisniewski, Natalie Kübler and François Yvon
Innovations in Parallel Corpus Search Tools - Martin Volk, Johannes Graën and Elena Callegaro
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer - Timur Gilmanov, Olga Scrivner and Sandra Kübler
The MERLIN corpus: Learner Language and the CEFR: Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne,Barbora Štindlová and Chiara Vettori
The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds:
- Metadata about the author and the test
- Test ratings according to the CEFR framework
- Error annotations
- Annotations to encourage second language acquisition research
Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed.
KoKo: an L1 Learner Corpus for German: andrea Abel, Aivars Glaznieks, Lionel Nicolas* and Egon Stemle
A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development.
Building a Reference Lexicon for Countability in English: Tibor Kiss, Francis Jeffry Pelletier and Tobias Stadtfeld
The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach
Large Scale Arabic Error Annotation: Guidelines and Framework: Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani and Kemal Oflazer
Learner corpora for Arabic as L2
A Comparison of MT Errors and ESL Errors - Homa B. Hashemi and Rebecca Hwa
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl
Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages - manjira sinha, Tirthankar Dasgupta and Anupam Basu
Online Experiments with the Percy Software Framework - Experiences and some Early Results - Christoph Draxler
sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - Marta Sabou, Kalina Bontcheva, Leon Derczynski and Arno Scharl
Some interesting papers
- A Database for Measuring Linguistic Information Content - Richard Sproat, Bruno Cartoni, Hyunjeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar and Evelyn Wenzel-Grondie
- Developing Politeness Annotated Corpus of Hindi Blogs - Ritesh Kumar
- Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus - Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan and Ann Sawyer
- The Ellogon Pattern Engine: Context-free Grammars over Annotations - Georgios Petasis
- Etymological WordNet: Tracing the History of Words - Gerard de Mello
- Distributed Distributional Similarities of Google Books over the Centuries - Martin Riedl, Richard Steuer and Chris Biemann
- Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings - Paul Buitelaar, Georgeta Bordea and Barry Coughlan
- Linguistic Landscaping of South Asia using Digital Language Resources: Genetic vs. Areal Linguistics - Lars Borin, Anju Saxena, Taraka Rama and Bernard Comrie
- Indian Subcontinent Language Vitalization - András Kornai and Pushpak Bhattacharyya