1 / 89

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

EM. Advanced Statistical Methods in NLP Ling 572 March 6, 2012. Slides based on F. Xia11. Roadmap. Motivation: Unsupervised learning Maximum Likelihood Estimation EM: Basic concepts Main ideas Example: Forward-backward algorithm. Motivation. Task: Train a speech recognizer

shay
Download Presentation

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EM Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11

  2. Roadmap • Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm

  3. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model

  4. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:

  5. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:

  6. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:

  7. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:

  8. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:

  9. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:

  10. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?

  11. Motivation • Task: Train a probabilistic context-free grammar • Model:

  12. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:

  13. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:

  14. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?

  15. Approach • Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc

  16. EM • Expectation-Maximization: • Two-step iterative procedure

  17. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation

  18. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate

  19. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11

  20. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11

  21. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11

  22. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  23. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  24. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  25. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  26. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  27. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11

  28. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11

  29. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11

  30. Simple Example, Formally • L(Θ) = log P(X|Θ) Based on F. Xia11

  31. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11

  32. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11

  33. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  34. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  35. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  36. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  37. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11

  38. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11

  39. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11

  40. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data

  41. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms

  42. Forms of EM Based on F. Xia11

  43. Forms of EM Based on F. Xia11

  44. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set

  45. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y

  46. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y

  47. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence

  48. Key Features of EM • General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT

  49. Mains Ideas in EM

  50. Maximum Likelihood • EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution

More Related