Better MT Using Parallel Dependency Trees

Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania 2003 (c) University of Pennsylvania

Outline • Motivation • The alignment algorithm • Algorithm at a glance • The framework • Heuristics • Walking through an example • Evaluation • Conclusion 2003 (c) University of Pennsylvania

Motivation (1)Statistical MT Approaches • Statistical MT approaches • Pioneered by (Brown et al., 1990, 1993) • Leverage large training corpus • Outperform traditional transfer based approaches • Major Criticism • No internal representation, syntax/semantics 2003 (c) University of Pennsylvania

Motivation (2) Hybrid Approaches • Hybrid approaches • (Wu, 1997) (Alshawi et al., 2000) (Yamada and Knight, 2001, 2002) (Gildea 2003) • Applying statistical learning to structured data • Problems with Hybrid MT Approaches • Structural Divergence (Dorr, 1994) • Vagaries of loose translations in real corpora 2003 (c) University of Pennsylvania

Motivation (3) • Holy grail: • Syntax based MT which captures structural divergence • Accomplished work • A new approach to the alignment of parallel dependency trees (paper published at MT summit IX) • Allowing non-isomorphism of dependency trees 2003 (c) University of Pennsylvania

We are here… 2003 (c) University of Pennsylvania

Define the Alignment Problem • Define the alignment problem • In natural language: find word mappings between English and Foreign sentences • In math: DefinitionFor each , find a labeling ,where 2003 (c) University of Pennsylvania

The IBM Models • The IBM way • Model 1: Orders of words don’t matter, i.e. “bag of words” model • Model 2: Condition the probabilities on the length and position • Model 3, 4, 5: • A. generate fertility of each english word • B. generate the identity • C. generate the position • Gradually adding positioning information 2003 (c) University of Pennsylvania

Using Dependency Trees • Positioning information can be acquired from parse trees • Parsers: (Collins, 1999) (Bikel, 2002) • Problems with using parse trees directly • Two types of nodes • Unlexicalized non-terminals control the domain • Using dependency trees • (Fox, 2002): best* phrasal cohesion properties • (Xia, 2001): constructing dependency trees from parse trees using the Tree Adjoining Grammar 2003 (c) University of Pennsylvania

The Framework (1) • Step 1: train IBM model 1 for lexical mapping probabilities • Step 2: find and fix high confidence mappings according to a heuristic functionh(f, e) The girl kissed her kitty cat The girl gave a kiss to her cat A pseudo-translation example 2003 (c) University of Pennsylvania

The Framework (2) • Step 3: • Partition the dependency trees on both sides w.r.t. fixed mappings • One fixed mapping creates one new “treelet” • Create a new set of parallel dependency structures 2003 (c) University of Pennsylvania

The Framework (3) • Step 4: Go back to Step 1 unless enough nodes fixed • Algorithm properties • An iterative algorithm • Time complexity O(n * T(h)), where T(h) is the time for the heuristic function in Step 2. • P(f |e) in IBM Model 1 has a unique global maximun • Guaranteed convergence • Results only depend on the heuristic function h(f, e) 2003 (c) University of Pennsylvania

Heuristics • Heuristic functions for Step 2 • Objective: find out the confidence of a mapping between a pair of words • First Heuristic: Entropy • Intuition: model probability distribution shape • Second heuristic: Inside-outside probability • Idea borrowed from PCFG parsing • Fertility threshold: rule out unlikely fertility ratio (>2.0) 2003 (c) University of Pennsylvania

Walking through an Example (1) • [English] I have been here since 1947. • [Chinese] 1947 nian yilai wo yizhi zhu zai zheli. • Iteration 1: • One dependency tree pair. Align “I” and “wo” 2003 (c) University of Pennsylvania

Walking through an Example (2) • Iteration 2: • Partition and form two treelet pairs. • Align “since” and “yilai” 2003 (c) University of Pennsylvania

Walking through an Example (3) • Iteration 3: • Partition and form three treelet pairs. • Align “1947” and “1947”, “here” and “zheli” 2003 (c) University of Pennsylvania

Evaluation • Training: • LDC Xinhua newswire Chinese – English parallel corpus • Filtered roughly 50%, 60K+ sentence pairs used • The parser generated 53130 parsed sentence pairs. • Evaluation: • 500 sentence pairs provided by Microsoft Research Asia. • Word level aligned by hand. • F-score: • A: set of word pairs aligned by automatic alignment • G: set of word pairs aligned in the gold file. 2003 (c) University of Pennsylvania

Results (1) • Results for IBM Model 1 to Model 4 (GIZA) • Bootstrapped from Model 1 to Model 4 • Signs of overfitting • Suspect caused by difference b/w genres in training/testing 2003 (c) University of Pennsylvania

Results (2) • Results for our algorithm: • Heuristic h1: (entropy) • Heuristic h2: (inside-outside probability) • The table shows results after one iteration, M1 = IBM model 1 • Overfitting problem • mainly caused by violation of the partition assumption in fine-grained dependency structures. 2003 (c) University of Pennsylvania

Outline • Motivation • Algorithm at a glance • The framework • Heuristics • Walking through an example • Evaluation • Conclusion 2003 (c) University of Pennsylvania

Conclusion • Model based on partitioning sentences according to their dependency structure • Without the unrealistic isomorphism assumption • Outperforms the unstructured IBM models on a large data set. • “Orthogonal” to the IBM models • uses syntactic structure but no linear ordering information. 2003 (c) University of Pennsylvania

Thank You! 2003 (c) University of Pennsylvania

Better MT Using Parallel Dependency Trees