Download
generation in the context of mt n.
Skip this Video
Loading SlideShow in 5 Seconds..
Generation in the Context of MT PowerPoint Presentation
Download Presentation
Generation in the Context of MT

Generation in the Context of MT

87 Views Download Presentation
Download Presentation

Generation in the Context of MT

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Generation in the Context of MT Final Report

  2. The Team • Senior members & affiliate members • Jan Hajič, Charles Univ., Prague Drago Radev, Univ. of Michigan • Gerald Penn, Univ. of Toronto Jason Eisner, Johns Hopkins Univ. • Owen Rambow, Univ. of Pennsylvania • Dan Gildea, Univ. of Pennsylvania Bonnie Dorr, Univ. of Maryland • Students: • Yuan Ding, Univ. of Pennsylvania Martin Čmejrek, Charles Univ., Prague • Terry Koo, MIT Kristen Parton, Stanford Univ. • Jan Cuřín, Charles University Ivona Kučerová, Charles University • Pre-workshop work (Charles University): • Zdeněk Žabokrtský Petr Pajas • Václav Honetschläger Alena Böhmová • Vladislav Kuboň Jiří Havelka

  3. The Goal • Generate English (linear surface form) • from syntactic-semantic sentence representation (so-called “tectogrammatical”, or TR) • Possible application setting: • machine translation • other uses: • Front-end for QA systems, summarization • Evaluate under various circumstances

  4. Tectogrammatical Representation According to his opinion UAL’s executives were misinformed about the financing of the original transaction

  5. Tectogrammatical Representation According to he opinion UAL’s executive were misinform about the financing of the original transaction

  6. TR in Machine Translation Vedení UAL bylo podle jeho názoru o financování původní transakce nesprávně informováno. NULL

  7. TR trees WS’02 transfer deep syntax to surface syntax (tectogrammatics, TR) word order punctutation “surface” syntax lemmatized,POS lemmatized,POS morphology (gen.) morphology/tagging Target language textENGLISH The MT Framework Source language textCZECH

  8. TR trees The MT Framework AR trees CZECH ENGLISH

  9. Tools and Data Resources • Tools: • WS98 Czech parser + other Czech tools (tagger) • GIZA (WS99) + ISI decoder • Data: • PTB (40k sentences) • PTB translation to Czech (11k sentences) • Prague Dependency Treebank 1.0 (90k sentences) • Prague Dependency Treebank 2.0 preliminary • 15k sentences manually annotated • Monolingual data

  10. The Evaluation Metric: BLEU • Plain English output (MT, Generation): • difficult and/or expensive to evaluate subjectively • BLEU (IBM): • automatic method, score 0..1 • relative scores  subjective human evaluation • needs several reference “gold standards” • n-gram-based metric w/small-length penalty • Different “local” evaluations throughout, too

  11. Presentation Outline • The Systems and Their Inputs • Getting the data & tools ready • The Statistical Generation System • The channel model • Word order, Punctuation, Morphology • The Hybrid Approach • Evaluation Results • Student Project Proposals • Conclusions and Future Directions

  12. Where are we? Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH

  13. The Systems and Their Inputs Martin Čmejrek

  14. WS02GMT System 1: statistical System 2: hybrid Output: English linear surface form Input 1: automatically created English TR Input 2: manually created English TR Input 3: improved automatic English TR (PropBank) Input 4: Czenglish TR (simple translation)

  15. Input 1: Automatic English TR Penn Treebank v. 3 + heads (Jason Eisner’s code + modifications) + lemmatization + word IDs + rule-based transformation to English AR, TR (by Kučerová & Žabokrtský)  English TR (I1), size: 40k sentences

  16. Input 2: Manual English TR Penn Treebank v. 3 Input 1 + manual annotation (correction) (IK) including: deep word order, conversion of grammatical codes  English TR (I2), size: 1.5k sentences

  17. Input 3: Enhanced Automatic English TR Penn Treebank v. 3 Input 1 + PropBank + additional sources  English TR (I3): size: 40k sentences

  18. Input 4: Automatic Czenglish TR Linear Surface Czech + Czech tagging & lemmatization + Parsed to Czech AR, Czech TR + [Simple] Transfer (Lemma translation) - lexical replacement dictionary collected from web, MRDs + trained on TR lemmas by GIZA  “Czenglish” TR (I4): 11k sentences

  19. Dictionary Filtering Frequencies on English Monolingual Corpus (North American News Text) 365 M words 4 Czech/English Dictionary Sources (WinGED, GNU/FDL, PCTrans, EuroWordNet) Merging, Pruning Czech POS English POS Czech/English parallel Penn TreeBank Corpus GIZA++ Training Czech/English Dictionary for Transfer Input Data Source Output Data Tools

  20. Word-by-word translation of TR lemmas • Word by word dictionary: 42 835 entries, 65408 translations • format: <e>tečka<t>N <tr>spot<trt>N<prob>0.353598 <tr>dot<trt>N<prob>0.28792 <tr>full @stop<trt>N<prob>0.28729 • 1-1, 1-2 (2-1 translations not yet implemented) • packed forest representation for multiple translation choice • simplified version – choose the first best

  21. Where are we? w/additional info Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH

  22. Automatically Annotating a Tectogrammatical Corpus Owen Rambow

  23. Goal • Use PropBank annotations to • Improve automatic construction of English TRs • Allow generation from “generic” pred-arg structures

  24. Types of Corpus Annotation • Surface Syntax • Deep Syntax • Local Lexical Semantics • Global Lexical Semantics • Hybrid: Deep Syntactic/Global Semantic = Tectogrammatical level used here

  25. loads prepobj subj obj John hay into comp trucks John loads hay into trucks Surface SyntaxE.g., Penn Treebank loaded prepobj prepobj subj by hay into is comp comp John trucks Hay is loaded into trucks by John

  26. load obj2 subj obj John hay truck Deep SyntaxE.g., TAG John loads hay into trucks Hay is loaded into trucks by John

  27. load arg1 arg0 arg2 John hay truck Local SemanticsPenn PropBank (brand new) John loads hay into trucks John loads trucks with hay

  28. load throw goal goal agent agent theme theme John hay truck John hay truck Global SemanticsLCS (U. Md.) John loads hay into trucks John throws hay into trucks

  29. Tectogrammatical Representation • First two syntactic arguments of verb: deep-syntactic • All other arguments: global semantic load load throw dir3 pat dir3 act act act acmp pat pat John hay truck John hay truck John hay truck John loads trucks with hay John loads hay into trucks John throws hay into trucks

  30. Why Use TR? Research Hypothesis: • Replacing function words by TR arc labels makes transfer easier • Choice of realization: target language-dependent • Deep-syntactic labels for first two arguments: realization more verb-specific • Global semantic labels on remaining arguments: realization just label-specific

  31. Available Resources for Input 3 • Surface syntax: PTB corpus (hand, checked) • Deep syntax: derived automatically from PTB (Chen01) • Local semantics: PropBank corpus and frame lexicon (hand, checked) • Global semantics: LCS lexicon (partially hand, partially checked) • TR: PTB subset corpus (hand), PropBank  TR dictionary (hand, not checked) (I. Kučerová)

  32. Experiment: Machine Learning of TR Labels Using Ripper • Ripper (Cohen 1996) = greedy symbolic rule learner, set- and bag-valued features • Features: • Surface, deep syntactic info • Local, global semantic info • Kučerová’s PropBank  TR dictionary (hand-crafted) • Input 1 (Automatic English TR)

  33. Results (TR Label Error Rates) Semantics none local local-global all PB TR dict none 58.8% 25.9% 23.7% 22.6% 37.7% Input 1 19.5% 17.7% 16.3% 15.9% 17.1% surface-deep 16.5% 16.4% 17.1% 16.7% 16.2% Syntax surface-deep-Inp1 15.5% 15.9% 16.2% 16.1% 14.4% Average accuracy on 5-fold cross-validation (1326 data points)

  34. Conclusions • Machine learning can improve on hand-written conversion rules (= Input 1) • PropBank is useful • Best results: • All syntactic features + PropBank  TR dictionary • Future work: use PropBank  LCS dictionary (developed during workshop)

  35. English TR to AR Word Order Punctuation Morphology Where are we? Transfer Deep syntax (Czech) CZECH ENGLISH

  36. The MAGENTA System • Statistically based • The pipeline: • TR to AR by a channel model • Word order by reordering on dep. trees • Punctuation insertion • Morphology

  37. Word Order Punctuation Morphology Where are we? Transfer English TR to AR Deep syntax (Czech) CZECH ENGLISH

  38. The Tree-to-Tree Transductions a A Jason Eisner . C+D c b B d E prep prep f e F det det

  39. misinform inform wrongly prep prep det det Translating trees a A c b B C+D learn this 2:1 mapping(or in dictionary) d E Also 1:2, 2:0, etc., &rearrangements ... f e F 0:1 mapping

  40. prep det Translating trees a A c b B C+D d E f e F

  41. Pred S S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her Statistical: Need a model of tree pairs Mainly interested in (TR,AR) pairs But our techniques are quite general E.g., example below is not a (TR,AR) pair “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

  42. “the girl gave a kiss to her cat” S S,gave NP PP NP,girl PP,to NP,kiss Det,a Det Det,the Det NP NP,cat Det Det,her Training: Our team has many tree pairs Should be nicer to model than string pairs - why we built them! What Czech trees went with what English trees in training? ... Learn parameters  of a joint model P(T1,T2). “the girl kissed her kitty cat” Pred,kissed Pred Subj Obj Obj,cat Obj Subj,girl Det Det,the kitty Det Det,her

  43. Pred Pred,kissed Obj,cat Subj Obj Subj,girl kitty Det Det Det,her Det,the Obj Decoding: Complete a tree pair Training: given T1 and T2 find  to maximize P(T1,T2) Decoding: given T1 and  find T2 to maximize P(T1,T2) Horrible sparse data problem - can’t just do tree lookup. “the girl kissed her kitty cat” ??

  44. could be trained on zillionsof individual English AR trees train on paired trees could also take advantage ofEnglish-Czech dictionaries How should a model oftree pairs look? Joint model P(T1,T2). Wise to use noisy-channel form: P(T1 | T2) * P(T2) But any joint model will do.

  45. S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

  46. S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat” different, bad alignment!

  47. kiss  gave a kiss • kitty cat  cat •   to “the girl kissed her kitty cat” “the girl gave a kiss to her cat” How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). • So model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology)

  48. Solution : Use the right grammar formalism Grammars can assemble words or phrases into trees. Let’s work up to the “right” formalism. • Model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology) • kiss  gave a kiss • cat  kitty cat •   to “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

  49. S NP NP VP VP Det Det N N V NP NP the girl Det N Context-Free Grammar “the girl kissed her cat” S etc.

  50. S,kissed S,kissed S S NP,girl NP NP VP,kissed VP,kissed VP,kissed NP,girl VP,kissed Det Det,the Det,the N,girl N,girl V, kissed NP,cat NP,cat NP N,girl N,girl V,kissed Det NP the the girl girl Det Det N,cat N,cat Augment CFG nonterminalswith headwords “the girl kissed her cat” S etc.