Monday, January 12, 2015

No Roman Hindi please

Chetan Bhagat  wrote an opinion piece calling for the replacement of the Devanagari script with the Roman script, describing it as an essential step for saving the Hindi language.

This is fundamentally a bad idea since:
  • The Roman script is clearly inferior to the Devanagari script. For instance, it is ambiguous in representing sounds: c can either be च (as in 'touch') or क (as in 'cut'). Why would you want to throw away a script designed on scientific principles of sound organization for one which is fairly arbitrary.
  • While having language specific hardware keyboards never took off, in the era of touch keyboards designing language specific keyboards is no barrier at all, and all smart tricks done for English keyboards (word completion, swipe, etc.) can be easily replicated for Devanagari. In fact, we can have innovative designs to make input easier. We can go further and have handwriting recognition systems.
  • In fact, even if we have to use the Roman keyboard on physical keyboards, there is no reason to adopt the Roman script for the language. Transliteration systems have become quite good to handle a wide variety of ambiguous mappings from Roman to Devanagari.
  • If there is a need for a common national script, then Devanagari should be the natural choice since it can be representative of all major Indian scripts and follows the same principles. In fact, languages in India which don't have much of a written history should be based on extensions of Devanagari. That will surely will be a political hot potato, so we maybe revive the Brahmi script with suitable extensions to accomodate all scripts in India, since they are but variants of the Brahmi script.

What is needed is that free, open source input solutions be developed for these core input methods so that they are widely and easily available and can become building blocks for language technologies. 

Chetan Bhagat was recently advocating non-jugaad solutions to Uber for the new-generation transportation solutions. I wonder why he proposes such jugadu solutions in this case? In fact, he is bent on destroying a 2000 year old, well-engineered solution.

Technical reasons apart, I don't know if there is a reason for this unwarranted alarm since the language seems to be thriving. While  I don't follow Hindi literature, atleast in popular culture (news, TV, Internet, etc.) the availability of Hindi content has only increased. The Union Government and the state governments in Hindi speaking states use Hindi for their official activities. In any case, how will the change of script help to preserve the language? Bhagat does not put forth any reasons. I agree that the use of English as the language of power and intellectual discourse may put regional languages at risk in the future, but the solution would be to enable people to access content and communicate in their native languages as has been done in Europe. With the rapid development of language technologies in recent times, that is clearly possible. Making people use foreign scripts will only result in a sense of inferiority and cut them off from the vast literature which is written in the Devanagari script. Instead of rejuvenating the language, it may just hasten its death.

Saturday, November 8, 2014

Statistical Machine Translation: Resources for Indian languages

At the Center For Indian Language Technology, IIT Bombay, we have hosted Shata-Anuvaadak (100 Translators), a broad coverage Statisitical Machine Translation system for Indian languages. It currently supports translation between 11 Indian languages:

  •     Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
  •     Dravidian languages: Tamil, Telugu, Malayalam
  •     English

It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. The system can be accessed at:

For more details, see the following publication:

Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. 2014. Shata-Anuvadak: Tackling Multiway Translation of Indian Languages . Language and Resources and Evaluation Conference (LREC 2014). 2014.

We are also making available software and resources developed in the Center for the system and for ongoing research. These are available under an open source license for research use. These include:


  • Indian Language, NLP tools: Common NLP tools for Indian languages that are useful for machine translation. Unicode Normalizers, Tokenizers, Morphology-analysers and Transliteration system.
  • Source Side Reodering system
  • A simple experiment management system for Moses
  • Translation Models for Phrase based SMT systems all language pairs in Shata-anuvaadak
  • Language Models for all language in Shata-anuvaadak
  • Transliteration models for some language pairs (Moses-based)

You can access these resources at:

Wednesday, June 4, 2014

LREC 2014 - Some paper/posters

I attended the Language Resources and Evaluation Conference (LREC 2014) in Reykjavik, Iceland last week. Just sharing some interesting papers/posters I came across. 

LREC is a rich conference to get exposure to a number of tools, datasets available across many areas of NLP research. I personally found useful tools/datasets for  on machine translation, crowdsourcing and grammar correction.

In addition, the conference also emphasizes multilinguality and hence there were a lot of papers/posters showcasing resource development in many languages. A lot of these resources are made available as open source software/open datasets. 

Along with the oral sessions, there were many poster sessions. I found the poster sessions more interesting and interactive and it was possible to cover much more material browsing the posters. 

The following is a small set of papers/posters I found interesting - primarily in the area of SMT, grammar correction, crowdsourcing plus some really cool ideas. You may want to look through the proceedings for literature relevant to your work: 

Machine Translation

Aligning Parallel Texts with InterText: Pavel Vondřička
A tool for automatic alignment of parallel text with post-alignment manual correction. The server version can manage projects and teams. The automatic alignment is based on the 'hunalign' algorithm. Both the server and desktop versions are open source. We could explore this tool for our corpus alignment/cleaning activities.

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus: Raivis Skadiņš and Jörg Tiedemann and Roberts Rozis and Daiga Deksne
Describes the construction of a new parallel corpus between various European from the EU Book service available online. The paper describes the use of various tools and techniques, which is quite informative. The use of language model for correct extraction of text from pdf is interesting. A comparison on hunalign, Microsoft Bilingual Aligner and Vanillashows the MBA outperforms the rest.

The AMARA Corpus: Building Parallel Language Resources for the Educational Domain: Ahmed AbdelaliFrancisco GuzmanHassan Sajjad and Stephan Vogel
The paper describes the construction of parallel corpus in the Educational domain using subtitles gather from various sources like Kha Academy, TED, Udacity, Coursera, etc. The translation were obtained via AMARA a collaborative platform for subtitle translation that many of these projects use. The corpus also contains parallel corpus of Hindi with many foreign languages (a few thousand sentences each). This could be useful to study translation between Indian and foreign languages using bridge languages. 

Machine Translationness: Machine-likeness in Machine Translation Evaluation: Joaquim Moré and Salvador Climent

Describes evaluation of use of SMT for automatic subtitling. The metrics involve human rating, automatic metrics and measures of productivity improvement for post-editing. On all counts, the subtitling shows good quality. 

On the Origin of Errors: a Fine-Grained Analysis of MT and PE Errors and their Relationship: Joke DaemsLieve Macken and Sonia Vandepitte

A classification of errors according to a taxonomy. The errors are for translation for English to Portuguese translation. Moses and Google Translate output has been annotated. 

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling: Sharid LoaicigaThomas Meyer and Andrei Popescu-Belis

The taraXÜ Corpus of Human-Annotated Machine Translations - Eleftherios AvramidisAljoscha BurchardtSabine HunsickerMaja PopovićCindy TscherwinkaDavid Vilar and Hans Uszkoreit

CFT13: a Resource for Research into the Post-editing Process - Michael CarlMercedes Martínez García and Bartolomé Mesa-Lao

Innovations in Parallel Corpus Search Tools - Martin VolkJohannes Graën and Elena Callegaro


The MERLIN corpus is a corpus under-development for study of second language learning of European languages. The languages under consideration include: Czech, German and Italian. It is a learner corpora annotated with information of various kinds: 
  • Metadata about the author and the test
  • Test ratings according to the CEFR framework
  • Error annotations
  • Annotations to encourage second language acquisition research
Data-oriented research in Second Language Learning has been focussed towards English as L2, but now we are seeing corpora for other languages being developed. 

KoKo: an L1 Learner Corpus for German: andrea AbelAivars GlaznieksLionel Nicolas* and Egon Stemle
A corpus of German as first language learners. Most learners are native German speakers. The learners have done one year of secondary education. The corpus is under development. 

Building a Reference Lexicon for Countability in English: Tibor KissFrancis Jeffry Pelletier and Tobias Stadtfeld
The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach

Large Scale Arabic Error Annotation: Guidelines and Framework: Wajdi ZaghouaniBehrang MohitNizar HabashOssama ObeidNadi TomehAlla RozovskayaNoura FarraSarah Alkuhlani and Kemal Oflazer
Learner corpora for Arabic as L2

A Comparison of MT Errors and ESL Errors - Homa B. Hashemi and Rebecca Hwa


Online Experiments with the Percy Software Framework - Experiences and some Early Results - Christoph Draxler

sloWCrowd: a Crowdsourcing Tool for Lexicographic Tasks - Marta SabouKalina BontchevaLeon Derczynski and Arno Scharl

Some interesting papers

Saturday, November 23, 2013

A Systematic Exploration of Diversity in Machine Translation - Paper Summary

An interesting paper regarding generating top-k translation outputs. 

Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013). A Systematic Exploration of Diversity in Machine Translation.EMNLP 2013

This paper discusses:

1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing

The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.

This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for  dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.

Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists. 

A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes. 

With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.

Tuesday, November 12, 2013

Large Data sources for NLP from Google

Google has made available two large and rich sources for NLP research:
These have been described in the following papers:
  •  Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books.  Science. 2011.
  • Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.

These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.

Google N-gram corpus
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:
  • One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.
  • Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form: 
    • the burnt_NOUN car (combination of POS tag and token queries)
    • _DET_ _NOUN_ (queries involving determiners only)
          There some restrictions on the 4 and 5-grams available, in order to prevent
          combinatorial explosion.  
  •  Information on head-modifer relations in also available, though the relation type is not specified
 You can use the Google N-gram viewer to query  this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc.

Google Syntactic N-gram corpus
While, the traditional n-grams  contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing n content words. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 viz. nodes, arcs, biarcs, triarcs and quadarcs. There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:
  • As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.
  • Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This  information is part of the extended component of the corpus ( extended-arcs, extended-arcs, etc.)
  • verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized. 
The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc.

Saturday, October 19, 2013

Hierarchical Phrase Based models

I read the David Chiang's ACL'05 paper on hierarchical phrase based models today. A quick summary:

Design Principles:
  • Formal, but not linguistic i.e. a syncronous CFG is used, however the grammar learnt may not correspond to a linguistic ('human'?) grammar.
  • Leverage the strengths of phrase-based system while moving to syntax based 

Basic Motivation:

Basic idea is to handle long distance reorderings that a phrase based model can't handle.
This is done by introducing a single non-terminal 'X' and having rules of the form:

  X-> a X_1 b X_2 c | d X_2 e X_1 f

where the subscripts indicate relative positions of the RHS non-terminals

In theory, the number of non-terminals on the RHS is not constrained. However, the limitation of this is that reorderings that happen at higher levels of a constituent parse tree may not be captured. The rules learnt by this system are more like lexicalized reordering templates. 

Special types of rules used:
  •  Glue rules: top level rule
  •  Entity rules: for translating dates, numbers, etc.

Learning rules

The starting point is the phrases learnt by a phrase based system, called 'initial phrase pairs' .  From each initial phrase pair, rules are extracted. In order to avoid too many rules and reduce spurious derivations, some heuristics are used. One noteworthy heuristic is that rules as constructed from as small initial phrase pairs as possible. Another is that each rule can have only two non-terminals on the RHS. This is done for decoding efficiency, probably because CYK algorithm expects a grammar with CNF where every rule has two non-terminals.

The model

The model is very similar to the phrase based model, a log-linear model with the same features, except that the phrase translation probabiliites are replaced by the rule translation probabilities. The probabilities are learnt in similar way.


Decoding is done via a CYK variant. The differences from a standard CYK parsing are:
- Parsing is done only for the source language sentence. So far so good.
- There is only one non-terminal. You would except this to make this to make the parsing easier. However, there is a catch.
- The language model of the target language has to be in integrated into the decoder. The paper says, "the language model is integrated by intersecting with the target side CFG", which I take to mean that the LM score of the sub-string spanned by a cell in the chart parsing is multiplied along with the rule weights. This means each cell has to keep track of the rule along with all the target string that the rule can generate in that span. Each is like a virtual non-terminal, and hence the effective number of non-terminals can be really large, especially for larger spans.
   What I have described here is naive, and the journal paper describes different strategies for integrating the language model. I will read up on and summarize that later. 
- The grammar is not CNF, though every rule still has only two non-terminals. I guess it is converted to CNF before decoding.

Another interesting problem is how to kind the top-k parses. The journal article describes this in detail too.

Optimizations to decoding

- Limiting the number of entries in a cell of the chart
- Pruning entries in the cell with very low scores as compared to the highest scoring rule in the cell

  • Chiang, David. "A hierarchical phrase-based model for statistical machine translation." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
  • Chiang, David. "Hierarchical phrase-based translation." Computational Linguistics 33.2 (2007): 201-228.

Tuesday, August 28, 2012

N-gram features for text classification

Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was  wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:

A Study Using n-gram Features for Text Categorization, Johannes Furnkranz

Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.