1 / 109

Joint Models with Missing Data for Semi-Supervised Learning

Joint Models with Missing Data for Semi-Supervised Learning. Jason Eisner NAACL Workshop Keynote – June 2009. 1. Outline. Why use joint models? Making big joint models tractable: Approximate inference and training by loopy belief propagation

kosey
Download Presentation

Joint Models with Missing Data for Semi-Supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint Models with Missing Datafor Semi-Supervised Learning Jason Eisner NAACL Workshop Keynote – June 2009 1

  2. Outline • Why use joint models? • Making big joint models tractable:Approximate inference and training by loopy belief propagation • Open questions: Semi-supervised training of joint models

  3. y x The standard story Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y)

  4. y x Some running examples Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y) E.g., in low-resource languages parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)

  5. Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x)

  6. sample of p(x)

  7. few params For any x, can now recover cluster c that probably generated it A few supervised examples may let us predict y from c E.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c) (joint model!) sample of p(x)

  8. Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x) • Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.) • But we do need to choose a model family for p(x,y)

  9. y x NLP + ML = ??? Task structuredinput (may beonly partlyobserved,so infer x,too) structured output (so already need jointinference for decoding, e.g.,dynamicprogramming) p(y|x) model depends on features of<x,y> (sparse features?) or features of <x,z,y> where z are latent (so infer z, too)

  10. Task1 Task2 Task3 Task4 y1 y2 y3 y4 x1 x2 x3 x4 Each task in a vacuum?

  11. Solved tasks help later ones? (e.g, pipeline) Task1 z1 x Task2 z2 Task3 z3 Task4 y

  12. Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Task4 y

  13. Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Impute <z2,z3> given x1 and y4! Task4 y

  14. A later step benefits from many earlier ones? Task1 z1 x Task2 z2 Task3 z3 Task4 y

  15. A later step benefits from many earlier ones? And conversely? Task1 z1 x Task2 z2 Task3 z3 Task4 y

  16. We end up with a Markov Random Field (MRF) z1 x Φ1 Φ2 z2 z3 Φ3 y Φ4

  17. Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centric, not task-centric = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y

  18. First, a familiar example Conditional Random Field (CRF) for POS tagging Familiar MRF example Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 18

  19. Familiar MRF example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 19

  20. Familiar MRF example: CRF ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 20

  21. Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … can’t be adj preferred find tags 21

  22. Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … preferred find tags (could be made to depend onentire observed sentence) 22

  23. Familiar MRF example: CRF “Unary” factor evaluates this tag Different unary factor at each position … … preferred find tags 23

  24. Familiar MRF example: CRF p(van) is proportionalto the product of all factors’ values on van … … v a n preferred find tags 24

  25. Familiar MRF example: CRF NOTE: This is not just a pipeline of single-tag prediction tasks (which might work ok in well-trained supervised case …) p(van) is proportionalto the product of all factors’ values on van = … 1*3*0.3*0.1*0.2 … … … v a n preferred find tags 25

  26. Task-centered view of the world Task1 z1 x Task2 z2 Task3 z3 Task4 y

  27. Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centered view of the world = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y

  28. Variable-centric, not task-centric Throw in any variables that might help!Model and exploit correlations

  29. lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation

  30. Back to our (simpler!) running examples parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)

  31. parse sentence (with David A. Smith) parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data

  32. Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

  33. word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data

  34. NULL Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

  35. word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse need aninteresting model much moretraining data

  36. Parses are not entirelyisomorphic Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to this question null siblings head-swapping monotonic

  37. Dependency Relations + “none of the above”

  38. Parser projection Typical test data (no translation observed): sentence parse word-to-word alignment parse oftranslation translation

  39. Parser projection Small supervised training set (treebank): sentence parse word-to-word alignment parse oftranslation translation

  40. Parser projection Moderate treebank in other language: sentence parse word-to-word alignment parse oftranslation translation

  41. Parser projection Maybe a few gold alignments: sentence parse word-to-word alignment parse oftranslation translation

  42. Parser projection Lots of raw bitext: sentence parse word-to-word alignment parse oftranslation translation

  43. Parser projection Given bitext, sentence parse word-to-word alignment parse oftranslation translation

  44. Parser projection Given bitext, try to impute other variables: sentence parse word-to-word alignment parse oftranslation translation

  45. Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse … sentence parse word-to-word alignment parse oftranslation translation

  46. Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser. sentence parse word-to-word alignment parse oftranslation translation We’ll see how belief propagation naturally handles this.

  47. English does help us impute Chinese parse Seeing noisy output of an English WSJ parser fixes these Chinese links 组织 中国 在 基本 建设 方面 , 开始 利用 国际 金融 的 贷款 进行 国际性 竞争性 招标 采购 Complement verbs swap objects Subject attaches to intervening noun N N P J N N V V N N ‘s N V J N N N , The corresponding bad versions found without seeing the English parse China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement

  48. Which does help us train a monolingual Chinese parser

  49. parse oftranslation’ translation’ alignment alignment (Could add a 3rd language …) sentence parse alignment parse oftranslation translation

  50. world (Could add world knowledge …) sentence parse word-to-word alignment parse oftranslation translation

More Related