Saturday, November 8, 2014

Statistical Machine Translation: Resources for Indian languages

At the Center For Indian Language Technology, IIT Bombay, we have hosted Shata-Anuvaadak (100 Translators), a broad coverage Statisitical Machine Translation system for Indian languages. It currently supports translation between 11 Indian languages:

  •     Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
  •     Dravidian languages: Tamil, Telugu, Malayalam
  •     English

It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. The system can be accessed at:

For more details, see the following publication:

Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. 2014. Shata-Anuvadak: Tackling Multiway Translation of Indian Languages . Language and Resources and Evaluation Conference (LREC 2014). 2014.

We are also making available software and resources developed in the Center for the system and for ongoing research. These are available under an open source license for research use. These include:


  • Indian Language, NLP tools: Common NLP tools for Indian languages that are useful for machine translation. Unicode Normalizers, Tokenizers, Morphology-analysers and Transliteration system.
  • Source Side Reodering system
  • A simple experiment management system for Moses
  • Translation Models for Phrase based SMT systems all language pairs in Shata-anuvaadak
  • Language Models for all language in Shata-anuvaadak
  • Transliteration models for some language pairs (Moses-based)

You can access these resources at:

Wednesday, June 4, 2014

LREC 2014 - Some paper/posters

I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across. 

LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for  on machine translation, crowdsourcing and grammar correction.

In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets. 

Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters. 

The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work: 

Machine Translation

Aligning Parallel Texts with InterText: Pavel Vondřička
A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.

The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: Ahmed AbdelaliFrancisco GuzmanHassan Sajjad and Stephan Vogel
The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages. 

Machine Translationness: Machine-likeness in Machine Translation Evaluation: Joaquim Moré and Salvador Climent

Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality. 

On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: Joke DaemsLieve Macken and Sonia Vandepitte

A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated. 

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: Sharid LoaicigaThomas Meyer and Andrei Popescu-Belis

The taraXÜ Corpus of Human-Annotated Machine Translations - Eleftherios AvramidisAljoscha BurchardtSabine HunsickerMaja PopovićCindy TscherwinkaDavid Vilar and Hans Uszkoreit

CFT13: a Resource for Research into the Post-editing Process - Michael CarlMercedes Martínez García and Bartolomé Mesa-Lao

Innovations in Parallel Corpus Search Tools - Martin VolkJohannes Graën and Elena Callegaro


The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds: 
  • Metadata about the author and the test
  • Test ratings according to the CEFR framework
  • Error annotations
  • Annotations to encourage second language acquisition research
Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed. 

KoKo: an L1 Learner Corpus for German: andrea AbelAivars GlaznieksLionel Nicolas* and Egon Stemle
A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development. 

Building a Reference Lexicon for Countability in English: Tibor KissFrancis Jeffry Pelletier and Tobias Stadtfeld
The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach

Large Scale Arabic Error Annotation: Guidelines and Framework: Wajdi ZaghouaniBehrang MohitNizar HabashOssama ObeidNadi TomehAlla RozovskayaNoura FarraSarah Alkuhlani and Kemal Oflazer
Learner corpora for Arabic as L2

A Comparison of MT Errors and ESL Errors - Homa B. Hashemi and Rebecca Hwa


Online Experiments with the Percy Software Framework - Experiences and some Early Results - Christoph Draxler

sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - Marta SabouKalina BontchevaLeon Derczynski and Arno Scharl

Some interesting papers