Saturday, November 23, 2013

A Systematic Exploration of Diversity in Machine Translation - Paper Summary

An interesting paper regarding generating top-k translation outputs. 

Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013). A Systematic Exploration of Diversity in Machine Translation.EMNLP 2013

This paper discusses:

1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing

The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.

This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for  dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.

Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists. 

A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes. 

With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.

Tuesday, November 12, 2013

Large Data sources for NLP from Google

Google has made available two large and rich sources for NLP research:
These have been described in the following papers:
  •  Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books.  Science. 2011.
  • Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.

These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.

Google N-gram corpus
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:
  • One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.
  • Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form: 
    • the burnt_NOUN car (combination of POS tag and token queries)
    • _DET_ _NOUN_ (queries involving determiners only)
          There some restrictions on the 4 and 5-grams available, in order to prevent
          combinatorial explosion.  
  •  Information on head-modifer relations in also available, though the relation type is not specified
 You can use the Google N-gram viewer to query  this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc.

Google Syntactic N-gram corpus
While, the traditional n-grams  contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing n content words. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 viz. nodes, arcs, biarcs, triarcs and quadarcs. There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:
  • As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.
  • Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This  information is part of the extended component of the corpus ( extended-arcs, extended-arcs, etc.)
  • verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized. 
The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc.

Saturday, October 19, 2013

Hierarchical Phrase Based models

I read the David Chiang's ACL'05 paper on hierarchical phrase based models today. A quick summary:

Design Principles:
  • Formal, but not linguistic i.e. a syncronous CFG is used, however the grammar learnt may not correspond to a linguistic ('human'?) grammar.
  • Leverage the strengths of phrase-based system while moving to syntax based 

Basic Motivation:

Basic idea is to handle long distance reorderings that a phrase based model can't handle.
This is done by introducing a single non-terminal 'X' and having rules of the form:

  X-> a X_1 b X_2 c | d X_2 e X_1 f

where the subscripts indicate relative positions of the RHS non-terminals

In theory, the number of non-terminals on the RHS is not constrained. However, the limitation of this is that reorderings that happen at higher levels of a constituent parse tree may not be captured. The rules learnt by this system are more like lexicalized reordering templates. 

Special types of rules used:
  •  Glue rules: top level rule
  •  Entity rules: for translating dates, numbers, etc.

Learning rules

The starting point is the phrases learnt by a phrase based system, called 'initial phrase pairs' .  From each initial phrase pair, rules are extracted. In order to avoid too many rules and reduce spurious derivations, some heuristics are used. One noteworthy heuristic is that rules as constructed from as small initial phrase pairs as possible. Another is that each rule can have only two non-terminals on the RHS. This is done for decoding efficiency, probably because CYK algorithm expects a grammar with CNF where every rule has two non-terminals.

The model

The model is very similar to the phrase based model, a log-linear model with the same features, except that the phrase translation probabiliites are replaced by the rule translation probabilities. The probabilities are learnt in similar way.


Decoding is done via a CYK variant. The differences from a standard CYK parsing are:
- Parsing is done only for the source language sentence. So far so good.
- There is only one non-terminal. You would except this to make this to make the parsing easier. However, there is a catch.
- The language model of the target language has to be in integrated into the decoder. The paper says, "the language model is integrated by intersecting with the target side CFG", which I take to mean that the LM score of the sub-string spanned by a cell in the chart parsing is multiplied along with the rule weights. This means each cell has to keep track of the rule along with all the target string that the rule can generate in that span. Each is like a virtual non-terminal, and hence the effective number of non-terminals can be really large, especially for larger spans.
   What I have described here is naive, and the journal paper describes different strategies for integrating the language model. I will read up on and summarize that later. 
- The grammar is not CNF, though every rule still has only two non-terminals. I guess it is converted to CNF before decoding.

Another interesting problem is how to kind the top-k parses. The journal article describes this in detail too.

Optimizations to decoding

- Limiting the number of entries in a cell of the chart
- Pruning entries in the cell with very low scores as compared to the highest scoring rule in the cell

  • Chiang, David. "A hierarchical phrase-based model for statistical machine translation." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
  • Chiang, David. "Hierarchical phrase-based translation." Computational Linguistics 33.2 (2007): 201-228.