An interesting paper regarding generating top-k translation outputs.
Gimpel, K., Batra, D., Dyer, C., & Shakhnarovich, G. (2013). A Systematic Exploration of Diversity in Machine Translation.EMNLP 2013
This paper discusses:
1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing
The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.
This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.
Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists.
1) Methods for generating most diverse MT outputs for a SMT system based on a linear decoding model.
2) Applying the top-k diverse outputs to various tasks: (1) system recombination (2) re-ranking top-k lists (3) human post-editing
The motivation for the work is the top-k lists are commonly used in many NLP tasks, including MT for looking at a large set of inputs before making decisions.
The general strategy to get these top-k lists is to get the top-k best outputs. However, often the top-k lists are very similar to each other and therefore have shown mixed results. Hence, the search for a method to get top-k diverse translations.
This is achieved by having a decoding procedure which iteratively generates best translations, one at a time. The decoding objective function adds a term for dissimilarity function which penalizes for similarity with previously generated translations. In this work, the dissimilarity function is simply an language model over sentences already output in previous iterations (however, for sentences in LM the score is negative to penalize). This helps to use the same decoding algorithm as a standard linear decoding function. This method increases the decoding time since a decoding has to be performed for each candidate in the top-k diverse list. The parameters n and λ are tuned with a held-out set.
Using the top-k diverse outputs provides better results than using top-k best lists. This difference is higher for smaller values of k. Also, an interesting analysis provided is which sentences benefit the most from top-k diverse lists. It turns out that sentences with lower BLEU scores (presumably difficult to translate) benefit from using the diverse lists, whereas sentences with high BLEU scores benefit from top-k best lists.
A point worth mentioning: While doing top-k re-ranking, one of the features the authors use is a LM score over word classes and this provides very good results. Brown clustering was used to learn the word classes.
With help of confidence scores, a decision can be dynamically made about which of the lists (diverse or best) should be used. There is scope for investigation into more similarity functions.