(Summary of the chapter 'Integrating Linguistic Information' in Philip Koehn's textbook 'Statistical Machine translation')
Traditional phrase based Statistical Machine Translation (SMT) has relied only on the surface form of words, but this can carry you only so far. Without considering any linguistic phenomena, there is no generalization possible and the SMT system ends up being a translation memory. Various kinds of linguistic information needs to be incorporated into the SMT process like:
- Name Transliteration and Number script conversions
- Morphology changes - inflections, compounding, segmentation - these problems if not handled lead to data sparsity problems
- Syntanctic phenomena like constituent structure, attachment, head-modifier re-orderings. Vanilla SMT is designed to handle local re-orderings but long range dependencies are not handled well.
One way to handle them is to pre-process the parallel corpus before training and then run the SMT tools. Pre-processing could include:
- Transliteration and back transliterations models need to be incorporated. An important problem is to identify the named entities in the first place.
- Splitting words for a morphology rich input language. Compounding and segmentation can be handled similarly.
- Re-ordering worries can be handled by re-ordering the input language sentences in a pre-processing before feeding it to the SMT system. This re-ordering can be done either by handcrafted rules or learnt from data. This could be shallow like POS tag based re-ordering rules or full fledged parsed based.
Similarly, some work may be done on the post processing side:
- If the output language is morphologically complex, then the morphological generation can take place in the post processing step after SMT. This assumes that the SMT system has generated enough information to be able to generate output morphology.
- Alternatively, in order to ensure grammaticallity of the output sentences, we can do re-ranking of the candidate translations on the output side based on syntactic features like agreement and parse correctness. Note that a distinction has been made between correctness of syntactic parse quality as defined for parsing and as required for MT systems.
The problem with such pre-processing and post-processing components is that these are themselves prone to error. The system does not handle all the errors in all components in an integrated framework, and necessitates the use of hard decision boundaries. A probabilistic approach which incorporates all these pre- and post-processing components would make a cleaner and more elegant approach. That is the motivation behind the factored translation model. In this model, the factors are basically annotations on the input and output words (e.g. morphology, POS factors). Translation and generation functions are defined on the factors, and these are integrated using a log linear model. This provides the best way to test a diverse set of features in a structured way. Of course, the size of the phrase translation table will now grow, but this can be handled by using pre-compiled data structured. Decoding could also blow up, but pruning can be used to cut the search space.