Unified Expectation Maximization

Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan RothUniversity of Illinois at Urbana-Champaign NAACL 2012, Montreal

Weakly Supervised Learning in NLP • Labeled data is scarce and difficult to obtain • A lot of work on learning with a small amount of labeled data • Expectation Maximization (EM) algorithm is the de facto standard • More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM • Constraint-driven Learning (CoDL; Chang et al, 07) • Posterior regularization (PR; Ganchev et al, 10)

Weakly Supervised Learning: EM and …? • Several variants of EM exist in the literature: Hard EM • Variants of constrained EM: CoDL and PR • Which version to use: EM (PR) vs hard EM (CoDL)????? • Or is there something better out there? • OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) • Includes existing EM algorithms • Pick the most suitable EM algorithm in a simple, adaptive, and principled way • Adapting to data, initialization, and constraints

Outline • Background: Expectation Maximization (EM) • EM with constraints • Unified Expectation Maximization (UEM) • Optimization Algorithm for the E-step • Experiments

Predicting Structures in NLP • Predict the output or dependent variable y from the space of allowed outputs Y given input variable xusing parameters or weight vectorw • E.g. • predict POS tags given a sentence, • predict word alignments given sentences in two different languages, • predict the entity-relation structure from a document • Prediction expressed as y* = argmaxy2YP (y | x; w)

Learning Using EM: a Quick Primer qt(y) = argminqKL( q(y) , P(y|x;wt) ) (Neal and Hinton, 99) qt(y) = P(y|x;wt) Conditional distribution of y given w Posterior distribution • Given unlabeled data: x, estimate w; hidden:y • for t = 1 … Tdo • E:step: estimate a posterior distribution, q, over y: • M:step: estimate the parameters ww.r.t. q: wt+1 = argmaxwEqlog P (x, y; w)

Other Version of EM: Hard EM Standard EM Hard EM E-step: M-step: argmaxwEqlog P (x, y; w) • E-step: argminqKL(qt(y),P(y|x;wt)) • M-step: argmaxwEqlog P (x, y; w) Not clear which version To use!!! q(y) =±y=y* y*=argmaxyP(y|x,w)

Constrained EM • Domain knowledge-based constraints can help a lot by guiding unsupervised learning • Constraint-driven Learning (Chang et al, 07), • Posterior Regularization (Ganchev et al, 10), • Generalized Expectation Criterion (Mann & McCallum, 08), • Learning from Measurements (Liang et al, 09) • Constraints are imposed on y(a structured object, {y1,y2…yn}) to specify/restrict the set of allowed structures Y

R23 R12 Entity-Relation Prediction: Type Constraints Per Loc Dole ’s wife, Elizabeth , is a resident of N.C. E1E2E3 lives-in Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints

Bilingual Word Alignment: Agreement Constraints • Align words from sentences in EN with sentences in FR • Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10) Picture: courtesy Lacoste-Julien et al

Structured Prediction Constraints Representation • Assume a set of linear constraints: Y= {y :Uy·b} • A universal representation (Roth and Yih, 07) • Can be relaxed into expectation constraints on posterior probabilities: Eq[Uy] ·b • Focus on introducing constraints during the E-step

Two Versions of Constrained EM Posterior Regularization (Ganchev et al, 10) Constraint driven-learning (Chang et al, 07) E-step: M-step: argmaxwEqlog P (x, y; w) • E-step: argminqKL(qt(y),P(y|x;wt)) Eq[Uy] ·b • M-step: argmaxwEqlog P (x, y; w) Not clear which version To use!!! y*=argmaxyP(y|x,w) Uy· b

So how do we learn…? • EM (PR) vs hard EM (CODL) • Unclear which version of EM to use (Spitkovsky et al, 10) • This is the initial point of our research • We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM) • UEM lets us pick the best EM algorithm in a principled way

Outline • Notation and Expectation Maximization (EM) • Unified Expectation Maximization • Motivation • Formulation and mathematical intuition • Optimization Algorithm for the E-step • Experiments

Motivation: Unified Expectation Maximization (UEM) EM Hard EM EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution qand is parameterized by a single parameter °

Unified EM (UEM) Changes the entropy of the posterior EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = yq(y) log q(y) – q(y) log p(y) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Different ° values ! different EM algorithms

Effect of Changing ° KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) q with ° = 1 q with ° = 1 Original Distribution p q with ° = 0 q with ° = -1

Unifying Existing EM Algorithms KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Changing °values results in different existing EM algorithms Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) No Constraints Hard EM EM -1 0 1 1 With Constraints ° CODL PR

Range of ° KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) We focus on tuning ° in the range [0,1] Infinitely many new EM algorithms Hard EM EM No Constraints 0 1 ° With Constraints LP approx to CODL (New) PR

Tuning ° in practice …… .1 .2 .3 1 0 ° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc. We tune ° using a small amount of development data over the range UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM

Outline • Setting up the problem • Unified Expectation Maximization • Solving the constrained E-step • Lagrange dual-based based algorithm • Unification of existing algorithms • Experiments

The Constrained E-step °-Parameterized KL divergence Domain knowledge-based linear constraints Standard probability simplex constraints For ° ¸ 0 ) convex

Solving the Constrained E-step for q(y) Iterate until convergence • Introduce dual variables ¸ for each constraint • Sub-gradient ascent on dual vars with O ¸ / Eq[Uy] – b • Compute q for given ¸ • For°>0,compute • With° !0, unconstrained MAP inference:

Some Properties of our E-step Optimization • We use a dual projected sub-gradient ascent algorithm (Bertsekas, 99) • Includes inequality constraints • For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition • For ° > 0: convex dual decomposition over individual models (e.g. HMMs) connected via dual variables • ° = 1: dual decomposition in posterior regularization (Ganchev et al, 08) • For ° = 0: Lagrange relaxation/dual decomposition for hard ILP inference (Koo et al, 10; Rush et al, 11)

Outline • Setting up the problem • Introduction to Unified Expectation Maximization • Lagrange dual-based optimization Algorithm for the E-step • Experiments • POS tagging • Entity-Relation Extraction • Word Alignment

Experiments: exploring the role of ° • Test if tuning °helps improve the performance over baselines • Study the relation between the quality of initialization and ° (or “hardness” of inference) • Compare against: • Posterior Regularization (PR) corresponds to ° = 1.0 • Constraint-driven Learning (CODL) corresponds to °= -1

Unsupervised POS Tagging • Model as first order HMM • Try varying qualities of initialization: • Uniform initialization: initialize with equal probability for all states • Supervised initialization: initialize with parameters trained on varying amounts of labeled data • Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization

Unsupervised POS tagging: Different EM instantiations EM Hard EM Initialization with 40-80 examples Initialization with 20 examples Performance relative to EM Initialization with 10 examples Initialization with 5 examples Uniform Initialization °

R23 R12 Experiments: Entity-Relation Extraction Dole ’s wife, Elizabeth , is a resident of N.C. E1E2E3 • Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities • Add constraints: • Type constraints between entity and relations • Expected count constraints to regularize the counts of ‘None’ relation • Semi-supervised learning with a small amount of labeled data

Result on Relations UEM Statistically significantly better than PR Macro-f1 scores % of labeled data

Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for wordalignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs

Word Alignment: EN-FR with 10k Unlabeled Data Alignment Error Rate

Word Alignment: EN-FR Alignment Error Rate

Word Alignment: FR-EN Alignment Error Rate

Word Alignment: EN-ES Alignment Error Rate

Word Alignment: ES-EN Alignment Error Rate

Experiments Summary • In different settings, different baselines work better • Entity-Relation extraction: CODL does better than PR • Word Alignment: PR does better than CODL • Unsupervised POS tagging: depends on the initialization • UEM allows us to choose the best algorithm in all of these cases • Best version of EM: a new version with 0 < ° < 1

Unified EM: Summary Questions? • UEM generalizes existing variations of EM/constrained EM • UEM provides new EM algorithms parameterized by a single parameter ° • Efficient dual projected subgradient ascent technique to incorporate constraints into UEM • The best °corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework • Tuning ° adaptively changes the entropy of the posterior • UEM is easy to implement: add a few lines of code to existing EM codes

Unified Expectation Maximization