Joint Models with Missing Data for Semi-Supervised Learning

Joint Models with Missing Datafor Semi-Supervised Learning Jason Eisner NAACL Workshop Keynote – June 2009 1

Outline • Why use joint models? • Making big joint models tractable:Approximate inference and training by loopy belief propagation • Open questions: Semi-supervised training of joint models

y x The standard story Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y)

y x Some running examples Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y) E.g., in low-resource languages parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)

Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x)

sample of p(x)

few params For any x, can now recover cluster c that probably generated it A few supervised examples may let us predict y from c E.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c) (joint model!) sample of p(x)

Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x) • Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.) • But we do need to choose a model family for p(x,y)

y x NLP + ML = ??? Task structuredinput (may beonly partlyobserved,so infer x,too) structured output (so already need jointinference for decoding, e.g.,dynamicprogramming) p(y|x) model depends on features of<x,y> (sparse features?) or features of <x,z,y> where z are latent (so infer z, too)

Task1 Task2 Task3 Task4 y1 y2 y3 y4 x1 x2 x3 x4 Each task in a vacuum?

Solved tasks help later ones? (e.g, pipeline) Task1 z1 x Task2 z2 Task3 z3 Task4 y

Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Task4 y

Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Impute <z2,z3> given x1 and y4! Task4 y

A later step benefits from many earlier ones? Task1 z1 x Task2 z2 Task3 z3 Task4 y

A later step benefits from many earlier ones? And conversely? Task1 z1 x Task2 z2 Task3 z3 Task4 y

We end up with a Markov Random Field (MRF) z1 x Φ1 Φ2 z2 z3 Φ3 y Φ4

Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centric, not task-centric = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y

First, a familiar example Conditional Random Field (CRF) for POS tagging Familiar MRF example Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 18

Familiar MRF example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 19

Familiar MRF example: CRF ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 20

Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … can’t be adj preferred find tags 21

Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … preferred find tags (could be made to depend onentire observed sentence) 22

Familiar MRF example: CRF “Unary” factor evaluates this tag Different unary factor at each position … … preferred find tags 23

Familiar MRF example: CRF p(van) is proportionalto the product of all factors’ values on van … … v a n preferred find tags 24

Familiar MRF example: CRF NOTE: This is not just a pipeline of single-tag prediction tasks (which might work ok in well-trained supervised case …) p(van) is proportionalto the product of all factors’ values on van = … 1*3*0.3*0.1*0.2 … … … v a n preferred find tags 25

Task-centered view of the world Task1 z1 x Task2 z2 Task3 z3 Task4 y

Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centered view of the world = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y

Variable-centric, not task-centric Throw in any variables that might help!Model and exploit correlations

lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation

Back to our (simpler!) running examples parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)

parse sentence (with David A. Smith) parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data

Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data

NULL Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse need aninteresting model much moretraining data

Parses are not entirelyisomorphic Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to this question null siblings head-swapping monotonic

Dependency Relations + “none of the above”

Parser projection Typical test data (no translation observed): sentence parse word-to-word alignment parse oftranslation translation

Parser projection Small supervised training set (treebank): sentence parse word-to-word alignment parse oftranslation translation

Parser projection Moderate treebank in other language: sentence parse word-to-word alignment parse oftranslation translation

Parser projection Maybe a few gold alignments: sentence parse word-to-word alignment parse oftranslation translation

Parser projection Lots of raw bitext: sentence parse word-to-word alignment parse oftranslation translation

Parser projection Given bitext, sentence parse word-to-word alignment parse oftranslation translation

Parser projection Given bitext, try to impute other variables: sentence parse word-to-word alignment parse oftranslation translation

Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse … sentence parse word-to-word alignment parse oftranslation translation

Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser. sentence parse word-to-word alignment parse oftranslation translation We’ll see how belief propagation naturally handles this.

English does help us impute Chinese parse Seeing noisy output of an English WSJ parser fixes these Chinese links 组织中国在基本建设方面，开始利用国际金融的贷款进行国际性竞争性招标采购 Complement verbs swap objects Subject attaches to intervening noun N N P J N N V V N N ‘s N V J N N N ， The corresponding bad versions found without seeing the English parse China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement

Which does help us train a monolingual Chinese parser

parse oftranslation’ translation’ alignment alignment (Could add a 3rd language …) sentence parse alignment parse oftranslation translation

world (Could add world knowledge …) sentence parse word-to-word alignment parse oftranslation translation

Joint Models with Missing Data for Semi-Supervised Learning

Joint Models with Missing Data for Semi-Supervised Learning

Presentation Transcript

Semi-supervised Learning

Semi-Supervised Learning

Semi-Supervised Learning

Semi-supervised learning

Semi-Supervised Learning with Graph Transduction

Semi-Supervised Learning

Semi-Supervised Learning with Graph Transduction

Semi-supervised learning

Semi-Supervised Learning

Semi-supervised Learning

Inductive Semi-supervised Learning

Supervised and semi-supervised learning for NLP

Semi-Supervised Learning

Semi-Supervised Learning

Semi-supervised Learning

Semi-Supervised Learning

Semi-Supervised Learning With Graphs

Semi-Supervised Learning With Graphs

COMP3503 Semi-Supervised Learning

Semi-Supervised Learning