Friday, September 23, 2011

Incorporating Linguistic Information into SMT Models

(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook 'Statistical Machine translation')


Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like: 

  • Name Transliteration and Number script conversions
  • Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems
  • Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well. 

One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:

  • Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.
  • Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly. 
  • Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based. 

Similarly, some work may be done on the post processing side: 

  • If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.
  • Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems. 

The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind the factored translation model. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors).  Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.

Language Divergence between English and Hindi

Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper, 'Interlingua-based English–Hindi Machine Translation and Language Divergence', summarizes the major differences between Hindi and English.

I have tried to tabulate the observations in the paper below, to make a handy reference:


Factor English Hindi



Word Order Subject-Verb-Object Subject-Object-Verb

Ram ate the mango राम ने आम खाया



Modifiers Post modifier Premodifier

The Prime Minister of India भारत का प्रधान मंत्री

play well अच्छे से खेलेंगे 



X-positions Prepositions Postpositions

of India भारत का 

Overloading

John ate rice with curd

John ate rice with a spoon



Compound Verbs not prevelant very common



Conjunct Verbs not prevelant very common


वह गाने लगे 


रुक जाओ 



Respect No special words Words indicating respect


आप, हम 



Person
Uses 2nd person for 3rd person

He obtained his degree आपने  अम्रीका से डिग्री प्राप्त की 



Gender Masculine, feminine, neuter Masculine, feminine



Gender specific possesive pronouns English has them Hindi lacks them

he, she वह



Morphology Poor Rich



Null subject divergence
Subject dropped in certain conditions

There was a king एक राजा था

I am going जा रहा हूँ 



Pleonastic divergence
Pleonastic dropped

It is raining बारिश हो रही है 



Conflational divergence
no appropriate word

Brutus stabbed Caesar ब्रूटस  ने सीसर को छुरे से मारा 



Categorical divergence
change in POS category

They are competing वे मुकाबला कर रहे है



Head swapping
Head and modifier are exchanged

The play is on खेल चल रहा है

Wednesday, September 21, 2011

Aligning Sentences to build a parallel corpus

This is a really old paper, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned. 

The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a  random variable X: the number of charters in language L2 per character or language L1. 

I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results: 

mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96. 
variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this: 



NLTK provides an implementation of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on?