This is a really old paper, from Gale & Church, on building a sentence aligned parallel corpus from a misaligned corpus. A dynamic programming formulation with a novel distance measure is used for alignment of the sentences. For a method as naive as this, the reported results are impressive on the Hansards corpus. Of course, the input corpus is paragraph aligned.
The basic premise is simple: Sentences containing less number of characters in one language contain less characters in the other language, and correspondingly for for longer sentence. Based on this idea, the distance between 2 sentences is defined by a random variable X: the number of charters in language L2 per character or language L1.
I tried to see the behavior of this variable for the English-Hindi language pair. On a 14000 sentence parallel corpus, here are the results:
mean(X) : 0.99, i.e. almost one Hindi character for an English character, which is in agreement with the paper's claims. Interesting thing is that if the whitespaces are not considered, the mean drops to 0.96.
variance(X): 0.01979136 - very low, so the mean is very reliable. A linear fit can't get better than this:
NLTK provides an implementation of the Gale-Church alignment algorithm. I tried running it on an absolutely parallel corpus, but the algorithm ends up misaligning the sentences. Reducing mean(X) to 0.9 also did not help. Wonder what's going on?
No comments:
Post a Comment