1 / 24

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance. Zhifei Li and Sanjeev Khudanpur Johns Hopkins University. J OS HU A: a scalable open-source parsing-based MT decoder. Written in JAVA language Chart-parsing Beam and Cube pruning

geona
Download Presentation

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

  2. JOSHUA: a scalable open-source parsing-based MT decoder • Written in JAVA language • Chart-parsing • Beam and Cube pruning • K-best extraction over a hypergraph • m-gram LM Integration • Parallel Decoding • Distributed LM (Zhang et al., 2006; Brants et al., 2007) • Equivalent LM state maintenance • We plan to add more functions soon Chiang (2007) New!

  3. Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Chart parsing • Bottom-up parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a hypergraph.

  4. Hypergraph S S Goal Item X | 0, 4 | the mat | a cat X | 0, 4 | a cat|the mat hyperedge (X0, X0) (X0, X0) X (X0的 X1, X0 X1) X | 0, 2 | the mat |NA X | 3, 4 | a cat |NA X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) on the mat of a cat X (垫子 上, the mat) X (猫, a cat) 垫子0上1 的2 猫3 item

  5. Hypergraph and Trees S S S S X (猫, a cat) X (垫子 上, the mat) 垫子0上1 的2 猫3 the mat a cat X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) (X0, X0) (X0, X0) (X0, X0) (X0, X0) X (猫, a cat) X (猫, a cat) X (猫, a cat) X (垫子 上, the mat) X (垫子 上, the mat) X (垫子 上, the mat) 垫子0上1 的2 猫3 垫子0上1 的2 猫3 猫3 垫子0上1 的2 the mat ’s a cat A cat of the mat a cat on the mat

  6. How to Integrate an m-gram LM? S | 0, 7 | <s> the |. </s> S (<s> S0 </s>,<s> S0 </s>) S | 0, 7 | the olympic |china . S (S0 X1, S0 X1) X | 1, 7 | will be |china . X (将 在 X0举行。, will be held in X0 .) S | 0, 1 | the olympic |olympic game X | 3, 6 | beijing of |of china S (X0, X0) X (X0的 X1, X1 of X0) X | 0,1 | the olympic |olympic game X | 5, 6 | beijing |NA X | 3, 4 | china |NA X (奥运会,the olympic game) X (北京, beijing) X (中国, china) 奥运会0 将1 在2 中国3 的4 北京5 举行。6 the olympic game will be held in beijing of china . • Three functions • Accumulate probability • Estimate future cost • State extraction • New 3-grams • will be held • be held in • held in beijing • in beijing of • Estimated total prob • 0.01*0.04=0.004 • Future prob • P(beijing of)=0.01 0.04=0.4*0.2*0.5 • New 3-gram • beijing of china 0.5 0.4 0.2

  7. Equivalent State Maintenance: overview X | 0, 3 | below cat | some rat X | 0, 3 | below cats | many rat X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, below X1 of X0) X | 0, 3 | below * | * rat X | 0, 3 | under cat | some rat X | 0, 3 | below cat | many rat X (在 X0的 X1 下, under X1 of X0) X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, under the X1 of X0) X (在 X0的 X1 下, below the X1 of X0) • In a straightforward implementation, different LM state words lead to different items • We merge multiple items into a single item by replacing some LM state words with asterisk wildcard • By merging items, we can explore larger hypothesis space using less time. • We only merge items when the length of English span l ≥m-1

  8. Back-off Parameterization of m-gram LMs • LM probability computation • Observations • A larger m leads to more backoff • Default backoff weight is 1 • For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires

  9. Equivalent State Maintenance: Right-side P(el+1|el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) state words future words State Prefix IS-A-PREFIX equivalent state el+1el+2el+3… el-2 * el-1 el-1 el el * * el-1 * el el el+1el+2el+3… el+1el+2el+3… * * el el no * * * • Why not right to left? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one • For the case of a 4-gram LM el-2 el-1 el no el-1 el no IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no

  10. Equivalent State Maintenance: Left-side P(e3|e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) future words state words State Suffix IS-A-SUFFFIX equivalent state …e-2e-1e0 …e-2e-1e0 …e-2e-1e0 * e1 * * * * e1 e1 e2 * * * e1 e1 e2 e2 * e3 e1 no • Why not left to right? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability • For the case of a 4-gram LM e1 e2 e3 no e1 e2 no P(e1| e-2 e-1 e0)=P(e1) β(e0)β(e-1 e0)β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1)β(e-1 e0 e1)

  11. Equivalent State Maintenance: summary Original Cost Function Modified Cost Function Finalizedprobability Estimated probability State extraction

  12. Experimental Results: Decoding Speed • System Training • Task: Chinese to English translation • Sub-sampling of bitext of about 3M sentence pairs • obtain 570k sentence pairs • LM training data: Gigaword and English side of bitext • Decoding speed • Number of rules: 3M • Number of m-grams: 49M 38 times faster than the baseline!

  13. Experimental Results: Distributed LM • Distributed Language Model • Eight 7-gram LMs • Decoding speed: 12.2 sec/sent

  14. Experimental Results: Equivalent LM States 30 50 70 90 120 150 200 • Search effort versus search quality • Equivalent LM State Maintenance • Sparse LM: a 7-gram LM built on about 19M words • Dense LM: a 3-gram LM build on about 130M words • The equivalent LM state maintenance is slower than the regular method. • Backoff happens less frequently • Inefficient suffix/prefix information lookup

  15. Summary • We describe a scalable parsing-based MT decoder • The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task • We propose a method to maintain equivalent LM states • The decoder is available at • http://www.cs.jhu.edu/~zfli/

  16. Acknowledgements • Thanks to Philip Resnik for letting me use the UMD Python decoder • Thanks to UMD MT group members for very helpful discussions • Thanks to David Chiang for Hiero and his original implementation in Python

  17. Thank you!

  18. Grammar Formalism • Synchronous Context-free Grammar (SCFG) • Ts: a set of source-language terminal symbols • Tt: a set of target-language terminal symbols • N: a shared set of nonterminal symbols • A set of rules of the form • a typical rule looks like:

  19. Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph.

  20. m-gram LM Integration • Three Functions • Accumulate probability • Estimate future cost • State extraction Cost Function Finalized probability Estimated probability State extraction

  21. Parallel and Distributed Decoding • Parallel Decoding • Divide the test set into multiple parts • Each part is decoded by a separate thread • The threads share the language/translation models in memory • Distributed Language Model (DLM) • Training • Divide the corpora into multiple parts • Train a LM on each part • Find the optimal weights among the LMs • Maximize the likelihood of a dev set • Decoding • Load the LMs into different servers • The decoder remotely calls the servers to obtain the probabilities • The decoder then interpolates the probabilities on the fly • To save communication overhead, a cache is maintained

  22. Chart-parsing • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph. • State of an Item • Source span, left-side nonterminal symbol, and left/right LM state • Decoding complexity

  23. Hypergraph S S (X0, X0) (X0, X0) X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) X (垫子 上, the mat) X (猫, a cat) 垫子0上1 的2 猫3 a cat on the mat • A hypergraph consists of a set of nodes and hyperedges • in parsing, they correspond to item and deductive step, respectively • Roughly, a hyperedge can be thought as a rule with pointers • State of an item • Source span, left-side nonterminal symbol, and left/right LM state Goal Item X | 0, 4 | the mat | a cat X | 0, 4 | a cat|the mat item X | 0, 2 | the mat |NA X | 3, 4 | a cat |NA hyperedge

More Related