Dependency Treelet Translation: Syntactically Informed Phrasal SMT (ACL 2005)

Dependency Treelet Translation: Syntactically Informed Phrasal SMT(ACL 2005) Chris Quirk, Arul Menezes and Colin Cherry

Outline • Limitations of SMT and previous work • Modeling and training • Decoding • Experiments • Conclusion

Limitations of string-based phrasal SMT • It allows only limit phrase reordering. • Ex: max jump, max skip • It cannot express linguistic generalization: • Ex: they cannot express “SOV  SVO” • Source and target phrases have to be contiguous: • Ex: it cannot handle “ne … pas”

Previous work on syntactic SMT: Simultaneous parsing • Inversion Transduction Grammars (Wu, 1997) • Using simplifying assumptions: X  AB • Head transducer (Alshawi et al., 2000) • Simultaneous induction of src and tgt dependency trees

Previous work on syntactic SMT: parsing + transfer • Tree-to-string (Yamada and Knight, 2001) • Parse tgt sentence, and convert the tgt tree to a src string • Path-based transfer model (Lin, 2004) • Translate paths in src dependency trees • LF-level transfer (Menezes and Richardson, 2001) • Parse both sr and tgt.

Previous work on syntactic SMT:pre- or post-processing • Post-processing (JHU 2003): re-ranking the n-best list of SMT output using syntactic models. • Parse MT output • No improvement, even when n=16,000 • Pre-processing (Xia & McCord, 2004; Colins et al, 2005; ….): • Reorder src sents before SMT • Some improvement

What’s new? • The union of translation: a treelet pair. • A treelet is an arbitrary connected subgraph (not necessarily a subtree)of a dependency tree. • In comparison: • Src n-grams: “phrase”-based SMT: • Path: (Lin, 2004) • Context-free rules: many transfer-based MT systems  Decoding is more complicated.

Required modules • Source dependency parser • Target word segmenter / tokenizer • Word aligner: GIZA++

Major steps for training • Align src and tgt words • Parse source side • Project dependency trees • Extract treelet translation pairs • Train an order model • Train other models

Step 1: Word alignment • Use GIZA++ to get alignments in both directions, and combine the results with heuristics. • One constraint: for n-to-1 alignments, the n src words have to be adjacent in the src dependency tree.

Heuristics used to accept alignments from the union  It does not accept m-to-n alignments

Step 2: parsing source side • It requires a source dependency parser that • produces unlabeled, ordered dependency trees, and • annotates each src word with a POS tag • Their system does not allow crossing dependencies: • h(i)=k  for any j between i and k, h(j) is also between i and k.

Step 3: Projecting dependency trees • Add links in the tgt dependency tree according to word alignment types: • 1-to-1: trivial • n-to-1: trivial • 1-to-n: use heuristics • Unaligned tgt words: use heuristics • Unaligned src words: ignore them

1-to-1 and n-to-1 alignments sk sl Sl’ tj ti

1-to-n alignment a b b1’ a’ b2’ The n tgt words should move as a unit: - treat the rightmost one as the head - all other words depend on it.

Unaligned target words Given unaligned tgt word at position j, find the closest positions (i,k), s.t. j is between i and k and ti depends on tk (or vice versa). ti tj tk Such (i,k) might not exist. Because no crossing is allowed, if (i,k) exists, it is unique.

An example startup properties and options proprietes et options de demarrage

proprietes et options de et proprietes options proprietes options demarrage demarrage de de The reattachment pass to ensure phrasal cohesion demarrage et

Reattachment pass • “For each node in the wrong order (relative to its siblings), we reattach it to the lowest of its ancestors s.t. it is in the correct place relative to its siblings and parent”. • Question: how does the reattachment work? • In what order are tree nodes checked? • Once a node is moved, can it be moved again? • How many levels do we have to check to decide where to attach a node?

11 9 13 5 8 6 15 1 7 10 2 12 3 4 14 An example

Step 3: Projecting dependency trees(Recap) • Before reattachment, the src and tgt dependency trees are almost isomorphic: • n-to-1: treat “many” src words as one node • 1-to-n: treat “many” tgt words as one node. • Unaligned tgt words: • Unaligned src words: • After reattachment, the two trees can look very different.

Step 4: Extracting treelet translation pairs • “We extract all pairs of aligned src and tgt treelets along with word-level alignment linkages, up to a configurable max size.” • Due to the reattachment step, a src treelet might not align to a tgt treelet.

Extraction algorithm • Enumerate all possible source treelets. • Look at the union of the target nodes aligned to source nodes. If it is a treelet, keep the treelet pair. • Allow treelets with wildcard roots. • Ex: doesn’t *  ne * pas • Max size of treelets: in practice, up to 4 src words. • Question: how many source treelets are there?

An example startup properties and options proprietes et options de demarrage

Step 5: training an order model

Another representation

Learning a dependent’s position w.r.t. its head P(pos(m,t) | S, T): S: src dependency tree T: unordered tgt dependency tree t (a.k.a. “h”): a node in T m: a child of t  Use a decision tree to decide pos(m)

The prob of the order of tgt tree c(t) is the set of nodes modifying t. (i.e., the children of t in the dependency tree) Assumption: the position of each child can be modeled independently in terms of head-relative position

The order model (cont) Comment: this model is both straightforward and Kind of counter-intuitive since treelets are subgraphs.

Step 6: train other models (si, ti) is a treelet pair. It assumes the uniform dist over all possible Decompositions of a tree into treelets. Two models: - MLE: - IBM Model 1

Step 6: train other models (cont) • Target LM: n-gram LM • Other features: • Target word number: word penalty • The number of “phrases” used. • ….

Treelet vs. string-based SMT • Similarities: • Use the log-linear framework. • Similar features: LM, word penalty, … • Differences: • Use treelet TM, instead of string-based TM. • The order model is w.r.t. dependency trees.

Challenges • Traditional left-to-right decoding approach is inapplicable. • The need to handle treelets: perhaps discontiguous or overlapping

Ordering strategies • Exhaustive search • Greedy ordering • No ordering

Exhaustive search • For each input node s, find the set of all treelet pairs that match S and are “rooted” at s. • Move bottom up through the src dependency tree, computing a list of possible tgt trees for each src subtree. • When attaching one subtree to another, try all possible permutations of children of root node.

Definitions

Exhaustive decoding algorithm

Greedy ordering • Too many permutations to consider in exhaustive search. • In the greedy ordering: • Given a fixed pre- and post-modifier count, we choose the best modifier for each position.

Greedy ordering algorithm

Numbers of candidatesconsidered at each node • c: # of children specified in treelet pair r: # of subtrees needed to be attached. • Exhaustive search: (c+r+1)! / (c+1)! • Greedy search: (c+r)r2

Dynamic Programming • In string-based SMT, hyps for the same covered src word vector: • The last two target words in the hyp: for LM • List size is O(V2) • In treelet translation, hyps for the same src subtree: • The head word: for the order model • The first two target words: for LM • The last two target words: for LM • List size is O(V5) • DP does not allow for great saving because of the context we have to keep.

Duplicate elimination • To eliminate unnecessary ordering operations, they use a hash table to check whether an unordered T has appeared before.

Pruning • Prune treelet pairs (before the search starts): • Keep pairs whose MLE prob > threshold • Given a src treelet, keep those whose prob within a ratio r of the best pair. • N-best lists: • Keep the N-best for each node in src dep tree.

Setting • Eng-Fr corpus of Microsoft technical data • Eng parser (NLPWIN): rule-based in-house parser.

Main results Max phrase size = 4

Effect of max phrase size

Dependency Treelet Translation: Syntactically Informed Phrasal SMT (ACL 2005)