On Organizing Information: November 2013

Saturday, November 23, 2013

A Systematic Exploration of Diversity in Machine Translation - Paper Summary

An interesting paper regarding generating top-k translation outputs.

Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013). A Systematic Exploration of Diversity in Machine Translation.EMNLP 2013

This paper discusses:

1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing

The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.

This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.

Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists.

A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes.

With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.

Tuesday, November 12, 2013

Large Data sources for NLP from Google

Google has made available two large and rich sources for NLP research:

These have been described in the following papers:

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science. 2011.

Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.

These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.

Google N-gram corpus
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:

One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.
Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form:

the burnt_NOUN car (combination of POS tag and token queries)
_DET_ _NOUN_ (queries involving determiners only)

There some restrictions on the 4 and 5-grams available, in order to prevent
combinatorial explosion.

Information on head-modifer relations in also available, though the relation type is not specified

You can use the Google N-gram viewer to query this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc.

Google Syntactic N-gram corpus
While, the traditional n-grams contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing n content words. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 viz. nodes, arcs, biarcs, triarcs and quadarcs. There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:

As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.
Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This information is part of the extended component of the corpus ( extended-arcs, extended-arcs, etc.)
verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized.

The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc.