Friday, September 23, 2011

Incorporating Linguistic Information into SMT Models

(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook 'Statistical Machine translation')

Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like: 

  • Name Transliteration and Number script conversions
  • Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems
  • Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well. 

One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:

  • Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.
  • Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly. 
  • Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based. 

Similarly, some work may be done on the post processing side: 

  • If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.
  • Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems. 

The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind the factored translation model. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors).  Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.

Language Divergence between English and Hindi

Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper, 'Interlingua-based English–Hindi Machine Translation and Language Divergence', summarizes the major differences between Hindi and English.

I have tried to tabulate the observations in the paper below, to make a handy reference:

Factor English Hindi

Word Order Subject-Verb-Object Subject-Object-Verb

Ram ate the mango राम ने आम खाया

Modifiers Post modifier Premodifier

The Prime Minister of India भारत का प्रधान मंत्री

play well अच्छे से खेलेंगे 

X-positions Prepositions Postpositions

of India भारत का 


John ate rice with curd

John ate rice with a spoon

Compound Verbs not prevelant very common

Conjunct Verbs not prevelant very common

वह गाने लगे 

रुक जाओ 

Respect No special words Words indicating respect

आप, हम 

Uses 2nd person for 3rd person

He obtained his degree आपने  अम्रीका से डिग्री प्राप्त की 

Gender Masculine, feminine, neuter Masculine, feminine

Gender specific possesive pronouns English has them Hindi lacks them

he, she वह

Morphology Poor Rich

Null subject divergence
Subject dropped in certain conditions

There was a king एक राजा था

I am going जा रहा हूँ 

Pleonastic divergence
Pleonastic dropped

It is raining बारिश हो रही है 

Conflational divergence
no appropriate word

Brutus stabbed Caesar ब्रूटस  ने सीसर को छुरे से मारा 

Categorical divergence
change in POS category

They are competing वे मुकाबला कर रहे है

Head swapping
Head and modifier are exchanged

The play is on खेल चल रहा है

Wednesday, September 21, 2011

Aligning Sentences to build a parallel corpus

This is a really old paper, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned. 

The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a  random variable X: the number of charters in language L2 per character or language L1. 

I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results: 

mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96. 
variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this: 

NLTK provides an implementation of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on? 

Wednesday, August 31, 2011

Watson - The Quiz Champion

You must have heard of IBM's Watson system. It is, of course, the computer that won the Jeopardy competition against the show's previous champions. Jeopardy is a popular quiz show in which the competitors are provided clues and have to give questions that satisfy these clues. For example, a clue like 'This computer beat the reigning world chess champion' would elicit a question 'Who is Deep Blue?'. As you can see, the questions given by the competitors are easy questions of the nature 'What is', 'Who is', so the Jeopary question answer format can be considered like any other quiz show. The clues however are complex covering a wide array of topics, and could include puns, puzzles, and maths. The competitors also place bets on each questions. Competing at 'Jeopardy' thus requires the right combination of 'natural language understanding, broad knowledge, confidence and strategy'.

Watson's victory thus represents a major milestone for natural language processing, and particularly the sub-area known as 'Question-Answering'. Question-Answering systems have great practical use for building expert systems, customer support system, decision making tools, enterprise search systems.

Watch Watson's winning performance here:

This paper, Building Watson: An Overview of the DeepQA project, from IBM provides an overview of Watson and the DeepQA architecture that underlies it. The DeepQA architecture defines a framework for development of QA systems in an extensible and modular method, allowing different components to be customized, and to build robust QA systems that can be ported across domains. Figure 1 shows a high level diagram of the Watson's major components, and how queries are routed through it.

  1. Query Analysis: This is the first stage, where the input clue is analyzed to determine the question category (puzzle, pune, mathematical, numeric, logical, etc.) and the answer type (person, location, organization, etc.). Complex clues are also decomposed into simpler clues.

  2. Hypothesis Generation: Watson has at its disposal many sources of information like encyclopedias, books, lists of things like people, countries, etc. Watson does not attempt to get the correct answer straightaway. Instead, it first focusses on generating as many possible candidate answers, called 'hypotheses'. This is to ensure that good answers are not missed in the pursuit of the perfect answer. The attempt is to increase recall at this stage.

  3. Soft Filtering: Watson may generate hundreds and thousands of hypotheses, which then have to be analyzed in detail to find the correct answer. To limit this deep analysis to only the most relevant answers, Watson filters out the bad candidates by employing a few techniques like mismatch between the expected and candidate answer type.

  4. Hypothesis and Evidence scoring: Now Watson does a deep analysis of the candidate answers by employing sophisticated linguistic and statistical techniques, and looks to gather evidence for each hypothesis. This is one of the most critical parts of Watson since the evidence collected will determine how good the answer is and how confident Watson can be about it.

  5. Merging and Ranking: Once the evidence is collected, the confidence scores are generated for each candidate and candidates ranked. Now, looking at the answer's confidence level Watson decides if it should answer the question or not.

Figure 1: DeepQA Architecture (Source: The IBM paper)

The flexibility in the DeepQA architecture is achieved through the use of the UIMA text analysis framework. At one point in the trials, Watson was taking about two hours to generate an answer. The answer was to parallelize Watson with UIMA-AS and this got the response time down to the quiz show's average of 2 to 5 seconds. The improvement in accuracy is even more startling. When the IBM team stared working on Watson, the difference between the show's participants and early prototypes of Watson was huge. Figure 2 depicts the evolution in Watson's performance. It started from the baseline where the precision and recall were nowhere near the cloud of points corresponding to actual human competitors, but gradually reached human level performance.

Figure 2: Watson's accuracy over time (Source: The IBM paper)

What enabled Watson to reach this level of performance? Many of the underlying analysis algorithms aren't new, but have been around in the research community for a long time. More than groundbreaking original research, it is pragmatic engineering that lies at the core of Watson's success and the following are the salient contributory factors:

  • Building an end-to-end system: Very early, the team build a baseline end-to-end system and then kept iterating and improving the system. They defined end-to-end evaluation metrics which captured the performance of the system as a whole, and not focusing only on the individual component accuracies at the initial stages. This helped make the correct trade-offs.

  • Pervasive Confidence estimation: Every component in Watson gives a confidence estimate along with its response. This is critical since these confidence scores can be aggregated to get the final confidence on the answers and allows easy integration of components of varying accuracy. The rule is that no component is assumed to be perfect, but each makes available its confidence estimate of the answers.

  • Many experts: There may be competing algorithms to do the same task. Rather than using the best, the system uses multiple algorithms so as to get diverse results and evidence. The confidence estimates help to blend the diverse results.

  • Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

  • Massive parallelism: As mentioned, exploiting massive parallelism allows looking through a large number of hypotheses.

(PS: Cross-posted from my Peepaal blog post)

    Wednesday, July 20, 2011

    Statistical Machine Translation - IBM Models

    At CFILT, a few of us have been working on understanding the IBM Models thoroughly. The IBM paper on SMT is a classic and seminal paper in the history of Machine Translation, and a must read for anybody wanting to work in this area. Its not an easy read, and we spent quite a lot of time figuring out how the estimation results are derived. Some notes sprung out of working for this discussion, and works out the steps missing in the original paper in detail. Hopefully it will be useful for everybody. These scanned notes of estimation for Model 1 and Model 2 can be found here. This is not a replacement for the original paper, but is just meant to supplement the reading of the original paper. Thanks to Mitesh for helping out with the key steps in the derivation.

    You can find the notes here

    Update: Finally I have created a PDF of the notes for Model 1 derivation. You can find them here. A few slides introducing SMT can be found here