Saturday, November 8, 2014

Statistical Machine Translation: Resources for Indian languages

At the Center For Indian Language Technology, IIT Bombay, we have hosted Shata-Anuvaadak (100 Translators), a broad coverage Statisitical Machine Translation system for Indian languages. It currently supports translation between 11 Indian languages:

  •     Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
  •     Dravidian languages: Tamil, Telugu, Malayalam
  •     English

It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. The system can be accessed at:

        http://www.cfilt.iitb.ac.in/indic-translator

For more details, see the following publication:

Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. 2014. Shata-Anuvadak: Tackling Multiway Translation of Indian Languages . Language and Resources and Evaluation Conference (LREC 2014). 2014.

We are also making available software and resources developed in the Center for the system and for ongoing research. These are available under an open source license for research use. These include:

Software

  • Indian Language, NLP tools: Common NLP tools for Indian languages that are useful for machine translation. Unicode Normalizers, Tokenizers, Morphology-analysers and Transliteration system.
  • Source Side Reodering system
  • A simple experiment management system for Moses
Resources
  • Translation Models for Phrase based SMT systems all language pairs in Shata-anuvaadak
  • Language Models for all language in Shata-anuvaadak
  • Transliteration models for some language pairs (Moses-based)

You can access these resources at:

    http://www.cfilt.iitb.ac.in/static/download.html