Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

JOSHUA: a scalable open-source parsing-based MT decoder • Written in JAVA language • Chart-parsing • Beam and Cube pruning • K-best extraction over a hypergraph • m-gram LM Integration • Parallel Decoding • Distributed LM (Zhang et al., 2006; Brants et al., 2007) • Equivalent LM state maintenance • We plan to add more functions soon Chiang (2007) New!

Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Chart parsing • Bottom-up parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a hypergraph.

Hypergraph and Trees S S S S X (猫, a cat) X (垫子上, the mat) 垫子0上1 的2 猫3 the mat a cat X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) (X0, X0) (X0, X0) (X0, X0) (X0, X0) X (猫, a cat) X (猫, a cat) X (猫, a cat) X (垫子上, the mat) X (垫子上, the mat) X (垫子上, the mat) 垫子0上1 的2 猫3 垫子0上1 的2 猫3 猫3 垫子0上1 的2 the mat ’s a cat A cat of the mat a cat on the mat

Equivalent State Maintenance: overview X | 0, 3 | below cat | some rat X | 0, 3 | below cats | many rat X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, below X1 of X0) X | 0, 3 | below * | * rat X | 0, 3 | under cat | some rat X | 0, 3 | below cat | many rat X (在 X0的 X1 下, under X1 of X0) X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, under the X1 of X0) X (在 X0的 X1 下, below the X1 of X0) • In a straightforward implementation, different LM state words lead to different items • We merge multiple items into a single item by replacing some LM state words with asterisk wildcard • By merging items, we can explore larger hypothesis space using less time. • We only merge items when the length of English span l ≥m-1

Back-off Parameterization of m-gram LMs • LM probability computation • Observations • A larger m leads to more backoff • Default backoff weight is 1 • For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires

Equivalent State Maintenance: Right-side P(el+1|el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) state words future words State Prefix IS-A-PREFIX equivalent state el+1el+2el+3… el-2 * el-1 el-1 el el * * el-1 * el el el+1el+2el+3… el+1el+2el+3… * * el el no * * * • Why not right to left? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one • For the case of a 4-gram LM el-2 el-1 el no el-1 el no IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no

Equivalent State Maintenance: Left-side P(e3|e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) future words state words State Suffix IS-A-SUFFFIX equivalent state …e-2e-1e0 …e-2e-1e0 …e-2e-1e0 * e1 * * * * e1 e1 e2 * * * e1 e1 e2 e2 * e3 e1 no • Why not left to right? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability • For the case of a 4-gram LM e1 e2 e3 no e1 e2 no P(e1| e-2 e-1 e0)=P(e1) β(e0)β(e-1 e0)β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1)β(e-1 e0 e1)

Equivalent State Maintenance: summary Original Cost Function Modified Cost Function Finalizedprobability Estimated probability State extraction

Experimental Results: Decoding Speed • System Training • Task: Chinese to English translation • Sub-sampling of bitext of about 3M sentence pairs • obtain 570k sentence pairs • LM training data: Gigaword and English side of bitext • Decoding speed • Number of rules: 3M • Number of m-grams: 49M 38 times faster than the baseline!

Experimental Results: Distributed LM • Distributed Language Model • Eight 7-gram LMs • Decoding speed: 12.2 sec/sent

Experimental Results: Equivalent LM States 30 50 70 90 120 150 200 • Search effort versus search quality • Equivalent LM State Maintenance • Sparse LM: a 7-gram LM built on about 19M words • Dense LM: a 3-gram LM build on about 130M words • The equivalent LM state maintenance is slower than the regular method. • Backoff happens less frequently • Inefficient suffix/prefix information lookup

Summary • We describe a scalable parsing-based MT decoder • The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task • We propose a method to maintain equivalent LM states • The decoder is available at • http://www.cs.jhu.edu/~zfli/

Acknowledgements • Thanks to Philip Resnik for letting me use the UMD Python decoder • Thanks to UMD MT group members for very helpful discussions • Thanks to David Chiang for Hiero and his original implementation in Python

Thank you!

Grammar Formalism • Synchronous Context-free Grammar (SCFG) • Ts: a set of source-language terminal symbols • Tt: a set of target-language terminal symbols • N: a shared set of nonterminal symbols • A set of rules of the form • a typical rule looks like:

Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph.

m-gram LM Integration • Three Functions • Accumulate probability • Estimate future cost • State extraction Cost Function Finalized probability Estimated probability State extraction

Parallel and Distributed Decoding • Parallel Decoding • Divide the test set into multiple parts • Each part is decoded by a separate thread • The threads share the language/translation models in memory • Distributed Language Model (DLM) • Training • Divide the corpora into multiple parts • Train a LM on each part • Find the optimal weights among the LMs • Maximize the likelihood of a dev set • Decoding • Load the LMs into different servers • The decoder remotely calls the servers to obtain the probabilities • The decoder then interpolates the probabilities on the fly • To save communication overhead, a cache is maintained

Chart-parsing • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph. • State of an Item • Source span, left-side nonterminal symbol, and left/right LM state • Decoding complexity

Hypergraph S S (X0, X0) (X0, X0) X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) X (垫子上, the mat) X (猫, a cat) 垫子0上1 的2 猫3 a cat on the mat • A hypergraph consists of a set of nodes and hyperedges • in parsing, they correspond to item and deductive step, respectively • Roughly, a hyperedge can be thought as a rule with pointers • State of an item • Source span, left-side nonterminal symbol, and left/right LM state Goal Item X | 0, 4 | the mat | a cat X | 0, 4 | a cat|the mat item X | 0, 2 | the mat |NA X | 3, 4 | a cat |NA hyperedge

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Presentation Transcript

Detecting Cartels Joe Harrington Johns Hopkins University

Johns Hopkins CPC4

Johns Hopkins University: The Research University Model

Johns Hopkins University Business Plan Competition

Johns Hopkins University Business Plan Competition

Johns Hopkins Hospital

James F Philbin, PhD Johns Hopkins University

Research Administration at Johns Hopkins University

Language and Speech Processing at Johns Hopkins University

Jayant Gupchup Graduate student, Johns Hopkins University

Data-Intensive Science at Johns Hopkins University

Wei Liu The Johns Hopkins University

Johns Hopkins University Department of Biomedical Engineering

Shane Bergsma Johns Hopkins University

Charles Flexner, MD Johns Hopkins University

Johns Hopkins University

Christopher Dreisbach, Ph.D. Johns Hopkins University

Johns Hopkins University Applied Physics Laboratory

Data-Intensive Science at Johns Hopkins University

Michael Kazhdan Johns Hopkins University