1 / 65

Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze)

Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze). Dr. Mary P. Harper ECE, Purdue University harper@purdue.edu yara.ecn.purdue.edu/~harper. Markov Assumptions.

orenda
Download Presentation

Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9: Hidden Markov Models (HMMs)(Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue University harper@purdue.edu yara.ecn.purdue.edu/~harper EE669: Natural Language Processing

  2. Markov Assumptions • Often we want to consider a sequence of randm variables that aren’t independent, but rather the value of each variable depends on previous elements in the sequence. • Let X = (X1, .., XT) be a sequence of random variables taking values in some finite set S = {s1, …, sN}, the state space. If X possesses the following properties, then X is a Markov Chain. • Limited Horizon: P(Xt+1= sk|X1, .., Xt) = P(X t+1 = sk |Xt) i.e., a word’s state only depends on the previous state. • Time Invariant (Stationary): P(Xt+1= sk|Xt) = P(X2 = sk|X1) i.e., the dependency does not change over time. EE669: Natural Language Processing

  3. Definition of a Markov Chain • A is a stochastic N x N matrixof the probabilities of transitions with: • aij = Pr{transition from si to sj} = Pr(Xt+1= sj | Xt= si) • ij, aij 0, and i: •  is a vector of N elements representing the initial state probability distribution with: • i = Pr{probability that the initial state is si} = Pr(X1= si) • i: • Can avoid this by creating a special start state s0. EE669: Natural Language Processing

  4. Markov Chain EE669: Natural Language Processing

  5. Markov Process & Language Models • Chain rule: P(W) = P(w1,w2,...,wT) = Pi=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1) • n-gram language models: • Markov process (chain) of the order n-1: P(W) = P(w1,w2,...,wT) = Pi=1..T p(wi|wi-n+1,wi-n+2,..,wi-1) Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)): Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: My car broke down , and within hours Bob ’s car broke down , too . p( ,|broke down) = p(w5|w3,w4) = p(w14|w12,w13) EE669: Natural Language Processing

  6. .6 1 a p h .4 .4 .3 .6 1 .3 e t 1 i .4 Start Example with a Markov Chain EE669: Natural Language Processing

  7. Probability of a Sequence of States • P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12 • P(t, a, p, e) = ? EE669: Natural Language Processing

  8. Hidden Markov Models (HMMs) • Sometimes it is not possible to know precisely which states the model passes through; all we can do is observe some phenomena that occurs when in that state with some probability distribution. • An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example: • Word recognition (discrete utterance or within continuous speech) • Phoneme recognition • Part-of-Speech Tagging • Linear Interpolation EE669: Natural Language Processing

  9. Example HMM • Process: • According to , pick a state qi. Select a ball from urn i (state is hidden). • Observe the color (observable). • Replace the ball. • According to A, transition to the next urn from which to pick a ball (state is hidden). • N states corresponding to urns: q1, q2, …,qN • M colors for balls found in urns: z1, z2, …, zM • A: N x N matrix; transition probability distribution (between urns) • B: N x M matrix; for each urn, there is a probability density function for each color. • bij = Pr(COLOR= zj | URN = qi) • : N vector; initial sate probability distribution (where do we start?) EE669: Natural Language Processing

  10. Green circles are hidden states Dependent only on the previous state What is an HMM? EE669: Natural Language Processing

  11. Purple nodes are observations Dependent only on their corresponding hidden state (for a state emit model) What is an HMM? EE669: Natural Language Processing

  12. Elements of an HMM • HMM (the most general case): • five-tuple (S, K, , A, B), where: • S = {s1,s2,...,sN} is the set of states, • K = {k1,k2,...,kM} is the output alphabet, •  = {i} i  S, initial state probability set • A = {aij} i,j  S, transition probability set • B = {bijk} i,j  S, k  K (arc emission) • B = {bik} i  S, k  K (state emission), emission probability set • State Sequence: X = (x1, x2, ..., xT+1), xt : S {1,2,…, N} • Observation (output) Sequence: O = (o1, ..., oT), ot  K EE669: Natural Language Processing

  13. {S, K, P, A, B} S : {s1…sN} are the values for the hidden states K : {k1…kM} are the values for the observations HMM Formalism S S S S S K K K K K EE669: Natural Language Processing

  14. {S, K, P, A, B}  = {pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation probabilities pi HMM Formalism A A A A S S S S S B B B K K K K K EE669: Natural Language Processing

  15. p(t)=.8 p(o)=.1 p(e)=.1 p(t)=0 p(o)=0 p(e)=1 p(t)=.1 p(o)=.7 p(e)=.2 ´ 2 1 0.6 0.12 enter here 1 0.4 1 0.88 4 3 p(t)=0 p(o)=.4 p(e)=.6 p(t)=0 p(o)=1 p(e)=0 p(t)=.5 p(o)=.2 p(e)=.3 Example of an Arc Emit HMM EE669: Natural Language Processing

  16. HMMs • N states in the model: a state has some measurable, distinctive properties. • At clock time t, you make a state transition (to a different state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition). • After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc. • Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations). EE669: Natural Language Processing

  17. Example HMM • N states corresponding to urns: q1, q2, …,qN • M colors for balls found in urns: z1, z2, …, zM • A: N x N matrix; transition probability distribution (between urns) • B: N x M matrix; for each urn, there is a probability density function for each color. • bij = Pr(COLOR= zj | URN = qi) • : N vector; initial sate probability distribution (where do we start?) EE669: Natural Language Processing

  18. Example HMM • Process: • According to , pick a state qi. Select a ball from urn i (state is hidden). • Observe the color (observable). • Replace the ball. • According to A, transition to the next urn from which to pick a ball (state is hidden). • Design: Choose N and M. Specify a model  = (A, B, ) from the training data. Adjust the model parameters (A, B, ) to maximize P(O| ). • Use: Given O and = (A, B, ), what is P(O| )? If we compute P(O| ) for all models , then we can determine which model most likely generated O. EE669: Natural Language Processing

  19. Simulating a Markov Process t:= 1; Start in state si with probability i (i.e., X1= i) Forever do Move from state si to state sj with probability aij(i.e., Xt+1= j) Emit observation symbol ot = k with probability bik (bijk ) t:= t+1 End EE669: Natural Language Processing

  20. Why Use Hidden Markov Models? • HMMs are useful when one can think of underlying events probabilistically generating surface events. Example: Part-of-Speech-Tagging. • HMMs can be efficiently trained using the EM Algorithm. • Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models. • Assuming that some set of data was generated by an HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences. EE669: Natural Language Processing

  21. Fundamental Questions for HMMs • Given a model  = (A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| ) • Given the observation sequence O and a model , how do we choose a state sequence (X1, …, XT+1) that best explains the observations? • Given an observation sequence O, and a space of possible models found by varying the model parameters  = (A, B, ), how do we find the model that best explains the observed data? EE669: Natural Language Processing

  22. Probability of an Observation • Given the observation sequence O = (o1, …, oT) and a model = (A, B, ), we wish to know how to efficiently compute P(O| ). • For any state sequence X = (X1, …, XT+1), we find: P(O| ) = X P(X | ) P(O | X, ) • This is simply the probability of an observation sequence given the model. • Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently. EE669: Natural Language Processing

  23. oT Probability of an Observation o1 ot-1 ot ot+1 Given an observation sequence and a model, compute the probability of the observation sequence EE669: Natural Language Processing

  24. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Probability of an Observation b EE669: Natural Language Processing

  25. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Probability of an Observation a b EE669: Natural Language Processing

  26. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Probability of an Observation EE669: Natural Language Processing

  27. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Probability of an Observation EE669: Natural Language Processing

  28. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Probability of an Observation(State Emit) EE669: Natural Language Processing

  29. Probability of an Observation with an Arc Emit HMM p(t)=.8 p(o)=.1 p(e)=.1 p(t)=0 p(o)=0 p(e)=1 p(t)=.1 p(o)=.7 p(e)=.2 ´ 2 1 0.6 0.12 enter here 1 0.4 p(toe) = .6´.8 ´.88´.7 ´1´.6 + .4´.5´1´1´.88´.2 + .4´.5´1´1´.12´1 @ .237 1 0.88 4 3 p(t)=0 p(o)=.4 p(e)=.6 p(t)=0 p(o)=1 p(e)=0 p(t)=.5 p(o)=.2 p(e)=.3 EE669: Natural Language Processing

  30. Making Computation Efficient • To avoid this complexity, use dynamic programming or memoization techniques due to trellis structure of the problem. • Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t. • A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths. • A forward variable, i(t)= P(o1o2…ot-1, Xt=i| ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t. EE669: Natural Language Processing

  31. The Trellis Structure EE669: Natural Language Processing

  32. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure • Special structure gives us an efficient solution using dynamic programming. • Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences (so don’t recompute it!) • Define: EE669: Natural Language Processing

  33. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure Note: We omit | from formula. EE669: Natural Language Processing

  34. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure Note: P(A , B) = P(A| B)*P(B) EE669: Natural Language Processing

  35. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  36. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  37. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  38. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  39. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  40. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure EE669: Natural Language Processing

  41. The Forward Procedure • Forward variables are calculated as follows: • Initialization: i(1)= i , 1 i  N • Induction: j(t+1)=i=1, N i(t)aijbijot, 1 tT, 1 jN • Total: P(O|)= i=1,N i(T+1) = i=1,N i=1, N i(t)aijbijot • This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1 EE669: Natural Language Processing

  42. The Backward Procedure • We could also compute these probabilities by working backward through time. • The backward procedure computes backward variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t. • i(t) = P(ot…oT | Xt = i, ) is a backward variable. • Backward variables are useful for the problem of parameter reestimation. EE669: Natural Language Processing

  43. Backward Procedure x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT EE669: Natural Language Processing

  44. The Backward Procedure • Backward variables can be calculated working backward through the treillis as follows: • Initialization: i(T+1) = 1, 1 i  N • Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i  N • Total: P(O|)=i=1, N ii(1) • Backward variables can also be combined with forward variables: • P(O|) = i=1,N i(t)i(t), 1 t  T+1 EE669: Natural Language Processing

  45. The Backward Procedure • Backward variables can be calculated working backward through the treillis as follows: • Initialization: i(T+1) = 1, 1 i  N • Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i  N • Total: P(O|)=i=1, N ii(1) • Backward variables can also be combined with forward variables: • P(O|) = i=1,N i(t)i(t), 1 t  T+1 EE669: Natural Language Processing

  46. Probability of an Observation x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure Backward Procedure Combination EE669: Natural Language Processing

  47. Finding the Best State Sequence • One method consists of optimizing on the states individually. • For each t, 1 t T+1, we would like to find Xt that maximizes P(Xt|O, ). • Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|) = (i(t)i(t))/j=1,Nj(t)j(t) • The individually most likely state is Xt =argmax1iN i(t), 1 t T+1 • This quantity maximizes the expected number of states that will be guessed correctly. However, it may yield a quite unlikely state sequence. ^ EE669: Natural Language Processing

  48. Find the state sequence that best explains the observation sequence Viterbi algorithm o1 ot-1 ot ot+1 oT Best State Sequence EE669: Natural Language Processing

  49. Finding the Best State Sequence: The Viterbi Algorithm • The Viterbi algorithm efficiently computes the most likely state sequence. • To find the most likely complete path, compute: argmaxX P(X|O,) • To do this, it is sufficient to maximize for a fixed O: argmaxX P(X,O|) • We define j(t) = maxX1..Xt-1 P(X1…Xt-1, o1..ot-1, Xt=j|) j(t) records the node of the incoming arc that led to this most probable path. EE669: Natural Language Processing

  50. Viterbi Algorithm x1 xt-1 j o1 ot-1 ot ot+1 oT The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t EE669: Natural Language Processing

More Related