1 / 64

Learning Bilingual Lexicons from Monolingual Corpora

Learning Bilingual Lexicons from Monolingual Corpora. Aria Haghighi , Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California, Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A.

blake
Download Presentation

Learning Bilingual Lexicons from Monolingual Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Bilingual Lexicons from Monolingual Corpora Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California, Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA

  2. Standard MT Approach Source Text Target Text • Need (lots of) parallel sentences • May not always be available • Need (lots of) sentences

  3. MT from Monotext • This talk: translation w/o parallel text? • Koehn and Knight (2002) & Fung (1995) • Need (lots of) sentences Source Text Target Text

  4. Task: Lexicon Induction Source Words s Target Words t Matching m estado state Source Text Target Text nombre world política name mundo nation

  5. Data Representation Context Features Orthographic Features world #st state 1.0 20.0 • 5.0 • 1.0 1.0 10.0 tat te# Source Text politics society What are we generating?

  6. Data Representation Orthographic Features Orthographic Features #st #es 1.0 1.0 tat sta estado 1.0 1.0 state te# do# 1.0 1.0 Context Features Context Features Source Text Target Text world mundo 17.0 20.0 politics politica 10.0 5.0 society sociedad 6.0 10.0 What are we generating?

  7. Canonical Correlation Analysis PCA PCA Target Space Source Space

  8. Canonical Correlation Analysis 3 2 1 2 1 3 2 1 PCA PCA 1 3 2 3 Target Space Source Space

  9. Canonical Correlation Analysis 1 2 3 1 2 3 CCA 2 1 1 3 2 3 CCA Target Space Source Space

  10. Canonical Correlation Analysis Canonical Space 1 2 3 2 1 1 3 2 3 Target Space Source Space

  11. Canonical Correlation Analysis Canonical Space 2 2 2 2 Target Space Source Space

  12. Generative Model Source Words s Target Words t Matching m

  13. Generative Model Canonical Space estado state Source Space Target Space

  14. Generative Model Source Words s Target Words t Matching m estado state nombre world politica name mundo nation

  15. Learning: EM? E-Step:Obtain posterior over matching M-Step:Maximize CCA Parameters

  16. Learning: EM? Getting expectations over matchings is #P-hard! See John DeNero’s paper “The Complexity of Phrase Alignment Problems” 0.2 .. 0.30 0.30 .. 0.15 0.10

  17. Inference: Hard EM • Hard E-Step: Findbipartite matching • M-Step: Solve CCA

  18. Experimental Setup • Nouns only (for now) • Seed lexicon – 100 translation pairs • Induce lexicon between top 2k source and target word-types • Evaluation: Precision and Recall against lexicon obtained from Wiktionary • Report p0.33, precision at recall 0.33

  19. Feature Experiments • Baseline: Edit Distance Precision 4k EN-ES Wikipedia Articles

  20. Feature Experiments • MCCA: Only orthographic features Precision 4k EN-ES Wikipedia Articles

  21. Feature Experiments • MCCA: Only Context features Precision 4k EN-ES Wikipedia Articles

  22. Feature Experiments • MCCA: Orthographic and context features Precision 4k EN-ES Wikipedia Articles

  23. Feature Experiments Precision Recall

  24. Feature Experiments Precision Recall

  25. Corpus Variation • Identical Corpora 93.8 Precision 100k EN-ES Europarl Sentences

  26. Corpus Variation ¼ • Comparable Corpora Precision 4k EN-ES Wikipedia Articles

  27. Corpus Variation ? • Unrelated Corpora 92 89 Precision 68 100k English and Spanish Gigaword

  28. Seed Lexicon Source • Automatic Seed • Use edit distance to induce seed lexicon as in Koehn & Knight (2002) 92 Precision 4k EN-ES Wikipedia Articles

  29. Analysis

  30. Analysis Top Non-Cognates

  31. Analysis Interesting Mistakes

  32. Language Variation

  33. Language Variation

  34. Analysis Orthography Features Context Features

  35. Summary • Learned bilingual lexicon from monotext • Matching + CCA model • Possible even from unaligned corpora • Possible for non-related languages • High-precision, but much left to do!

  36. Thank you! http://nlp.cs.berkeley.edu

  37. Error Analysis • Top 100 errors • 21 correct translations not in gold • 30 were semantically related • 15 were orthographically related (coast,costas) • 30 were seemingly random

  38. Bleu Experiment • On English-French only 1k parallel sentences • Without lexicon BLEU: 13.61 • With lexicon BLEU: 15.22

  39. More Numbers

  40. Conclusion • Three cases of unsupervised learning in NLP • Unsupervised systems can be competitive with supervised systems • Future problems • Document summarization • Building MindNet-like resources • Discourse Analysis

  41. Generative Model Orthographic Features #st 1.0 tat 1.0 Latent Space te# 1.0 Context Features world 20.0 politics 5.0 society 10.0 estado state Source Space Target Space Generate Matched Words

  42. Generative Model Orthographic Features #st 1.0 tat 1.0 Latent Space te# 1.0 Context Features world 20.0 politics 5.0 society 10.0 state state estado Source Space Target Space Generate Matched Words

  43. Translation Lexicon Induction Source Words s Target Words t Matching m state estado Source Text Target Text world nombre name mundo

  44. Generative Model • For each matched word pair: • For each unmatched source word: • For each unmatched target word:

  45. Results: Accuracy

  46. Corpus Variation • Disjoint Sentences

  47. Corpus Variation • Unrelated ?

  48. Machine Translation Source Text Target Text

More Related