On Organizing Information: September 2011

Friday, September 23, 2011

Incorporating Linguistic Information into SMT Models

(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook 'Statistical Machine translation')

Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like:

Name Transliteration and Number script conversions
Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems
Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well.

One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:

Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.
Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly.
Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based.

Similarly, some work may be done on the post processing side:

If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.
Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems.

The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind the factored translation model. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors). Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.

Language Divergence between English and Hindi

Comparing two languages is interesting, especially for an application for machine translation. Languages exhibit so many differences, it mind-boggling to realize that we navigate between languages with ease. This paper, 'Interlingua-based English–Hindi Machine Translation and Language Divergence', summarizes the major differences between Hindi and English.

I have tried to tabulate the observations in the paper below, to make a handy reference:

Factor	English	Hindi

Word Order	Subject-Verb-Object	Subject-Object-Verb
	Ram ate* the mango*	राम ने आम खाया

Modifiers	Post modifier	Premodifier
	The Prime Minister of India	भारत का प्रधान मंत्री
	play well	अच्छे से खेलेंगे

X-positions	Prepositions	Postpositions
	of India	भारत का
	Overloading
	John ate rice with curd
	John ate rice with a spoon

Compound Verbs	not prevelant	very common

Conjunct Verbs	not prevelant	very common
		वह गाने लगे
		रुक जाओ

Respect	No special words	Words indicating respect
		आप, हम

Person		Uses 2nd person for 3rd person
	He obtained his degree	आपने अम्रीका से डिग्री प्राप्त की

Gender	Masculine, feminine, neuter	Masculine, feminine

Gender specific possesive pronouns	English has them	Hindi lacks them
	he, she	वह

Morphology	Poor	Rich

Null subject divergence		Subject dropped in certain conditions
	There was a king	एक राजा था
	I am going	जा रहा हूँ

Pleonastic divergence		Pleonastic dropped
	It is raining	बारिश हो रही है

Conflational divergence		no appropriate word
	Brutus stabbed Caesar	ब्रूटस ने सीसर को छुरे से मारा

Categorical divergence		change in POS category
	They are competing	वे मुकाबला कर रहे है

Head swapping		Head and modifier are exchanged
	The play is on	खेल चल रहा है

Wednesday, September 21, 2011

Aligning Sentences to build a parallel corpus

This is a really old paper, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned.

The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a random variable X: the number of charters in language L2 per character or language L1.

I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results:

mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96.

variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this:

NLTK provides an implementation of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on?