Saturday, October 19, 2013

Hierarchical Phrase Based models

I read the David Chiang's ACL'05 paper on hierarchical phrase based models today. A quick summary:

Design Principles:
  • Formal, but not linguistic i.e. a syncronous CFG is used, however the grammar learnt may not correspond to a linguistic ('human'?) grammar.
  • Leverage the strengths of phrase-based system while moving to syntax based 

Basic Motivation:

Basic idea is to handle long distance reorderings that a phrase based model can't handle.
This is done by introducing a single non-terminal 'X' and having rules of the form:

  X-> a X_1 b X_2 c | d X_2 e X_1 f

where the subscripts indicate relative positions of the RHS non-terminals

In theory, the number of non-terminals on the RHS is not constrained. However, the limitation of this is that reorderings that happen at higher levels of a constituent parse tree may not be captured. The rules learnt by this system are more like lexicalized reordering templates. 

Special types of rules used:
  •  Glue rules: top level rule
  •  Entity rules: for translating dates, numbers, etc.

Learning rules

The starting point is the phrases learnt by a phrase based system, called 'initial phrase pairs' .  From each initial phrase pair, rules are extracted. In order to avoid too many rules and reduce spurious derivations, some heuristics are used. One noteworthy heuristic is that rules as constructed from as small initial phrase pairs as possible. Another is that each rule can have only two non-terminals on the RHS. This is done for decoding efficiency, probably because CYK algorithm expects a grammar with CNF where every rule has two non-terminals.

The model

The model is very similar to the phrase based model, a log-linear model with the same features, except that the phrase translation probabiliites are replaced by the rule translation probabilities. The probabilities are learnt in similar way.

Decoding

Decoding is done via a CYK variant. The differences from a standard CYK parsing are:
- Parsing is done only for the source language sentence. So far so good.
- There is only one non-terminal. You would except this to make this to make the parsing easier. However, there is a catch.
- The language model of the target language has to be in integrated into the decoder. The paper says, "the language model is integrated by intersecting with the target side CFG", which I take to mean that the LM score of the sub-string spanned by a cell in the chart parsing is multiplied along with the rule weights. This means each cell has to keep track of the rule along with all the target string that the rule can generate in that span. Each is like a virtual non-terminal, and hence the effective number of non-terminals can be really large, especially for larger spans.
   What I have described here is naive, and the journal paper describes different strategies for integrating the language model. I will read up on and summarize that later. 
- The grammar is not CNF, though every rule still has only two non-terminals. I guess it is converted to CNF before decoding.

Another interesting problem is how to kind the top-k parses. The journal article describes this in detail too.

Optimizations to decoding

- Limiting the number of entries in a cell of the chart
- Pruning entries in the cell with very low scores as compared to the highest scoring rule in the cell

References
  • Chiang, David. "A hierarchical phrase-based model for statistical machine translation." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
  • Chiang, David. "Hierarchical phrase-based translation." Computational Linguistics 33.2 (2007): 201-228.