1 / 46

HMM - Part 2

HMM - Part 2. Review of the last lecture The EM algorithm Continuous density HMM. P(up, up, up, up, up|  )?. Three Basic Problems for HMMs. Given an observation sequence O =( o 1 , o 2 ,…, o T ), and an HMM  = ( A , B , )

bert
Download Presentation

HMM - Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HMM - Part 2 • Review of the last lecture • The EM algorithm • Continuous density HMM

  2. P(up, up, up, up, up|)? Three Basic Problems for HMMs • Given an observation sequence O=(o1,o2,…,oT),and an HMM =(A,B,) • Problem 1:How to compute P(O|) efficiently ? The forward algorithm • Problem 2:How to choose an optimal state sequence Q=(q1,q2,……, qT)which best explains the observations?The Viterbi algorithm • Problem 3:How to adjust the model parameters =(A,B,)to maximizeP(O|)?The Baum-Welch (forward-backward) algorithm cf. The segmental K-means algorithm maximizes P(O,Q* |)

  3. The Forward Algorithm • The forward variable: • Probability of o1,o2,…,otbeing observed and the state at time tbeingi, given model λ • The forward algorithm

  4. The Viterbi Algorithm • Initialization • Induction • Termination • Backtracking is the best state sequence

  5. The Segmental K-means Algorithm • Assume that we have a training set of observations and an initial estimate of model parameters • Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm • Step 2 : Re-estimate the model parameters • Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

  6. Segmental K-means vs. Baum-Welch

  7. cf. The Backward Algorithm • The backward variable: • Probability of ot+1,ot+2,…,oTbeing observed,given the state at time t being i and model  • The backward algorithm

  8. ot The Forward-Backward Algorithm • Relation between the forward and backward variables (Huang et al., 2001)

  9. The Baum-Welch Algorithm (1/3) • Define two new variables: t(i)= P(qt = i | O, ) • Probability of being in state i at time t, givenOand t( i, j )=P(qt = i, qt+1 = j | O, ) • Probability of being in state i at time t and state jat time t+1, givenOand

  10. The Baum-Welch Algorithm (2/3) t(i)= P(qt = i | O, ) • Probability of being in state i at time t, givenOand t( i, j )=P(qt = i, qt+1 = j | O, ) • Probability of being in state i at time t and state jat time t+1, givenOand

  11. How do you know ? The Baum-Welch Algorithm (3/3) • Re-estimation formulae for  , A, andB are

  12. Maximum Likelihood Estimation for HMM However, we cannot find the solution directly. An alternative way is to find a sequence: s.t.

  13. Solvable and can be proved that Q function Jensen’s inequality If f is a concave function, and X is a r.v., then E[f(X)]≤ f(E[X])

  14. The EM Algorithm • EM: Expectation Maximization • Why EM? • Simple optimization algorithms for likelihood functions rely on the intermediate variables, called latent dataFor HMM, the state sequence is the latent data • Direct access to the data necessary to estimate the parameters is impossible or difficultFor HMM, it is almost impossible to estimate (A, B, ) without considering the state sequence • Two Major Steps : • E step: compute the expectation of the likelihood by including the latent variables as if they were observed • M step: compute the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step

  15. Three Steps for EM Step 1. Draw a lower bound • Use the Jensen’s inequality Step 2. Find the best lower bound  auxiliary function • Let the lower bound touch the objective function at the current guess Step 3. Maximize the auxiliary function • Obtain the new guess • Go to Step 2 until converge [Minka 1998]

  16. Given the current guess , the goal is to find a new guess such that Form an Initial Guess of =(A,B,) objective function current guess

  17. Step 1. Draw a Lower Bound objective function lower bound function

  18. auxiliary function Step 2. Find the Best Lower Bound objective function lower bound function

  19. Step 3. Maximize the Auxiliary Function objective function auxiliary function

  20. Update the Model objective function

  21. Step 2. Find the Best Lower Bound objective function auxiliary function

  22. Step 3. Maximize the Auxiliary Function objective function

  23. A lower bound function of Step 1. Draw a Lower Bound (cont’d) Objective function If f is a concave function, and X is a r.v., then E[f(X)]≤ f(E[X]) p(Q): an arbitrary probability distribution Apply Jensen’s Inequality

  24. Step 2. Find the Best Lower Bound (cont’d) • Find that makes the lower bound function touch the objective function at the current guess

  25. Take the derivative w.r.t and set it to zero Step 2. Find the Best Lower Bound (cont’d)

  26. Define We can check Q function Step 2. Find the Best Lower Bound (cont’d)

  27. Expectation EM for HMM Training • Basic idea • Assume we have and the probability that each Q occurred in the generation of O i.e., we have in fact observed a complete data pair (O,Q) with frequency proportional to the probability P(O,Q|) • We then find a new that maximizes • It can be guaranteed that • EM can discover parameters of model  to maximize the log-likelihood of the incomplete data, logP(O|), by iteratively maximizing the expectation of the log-likelihood of the complete data, logP(O,Q|)

  28. Solution to Problem 3 - The EM Algorithm • The auxiliary function where and can be expressed as

  29. example wi yi wj yj wk yk Solution to Problem 3 - The EM Algorithm (cont’d) • The auxiliary function can be rewritten as

  30. Solution to Problem 3 - The EM Algorithm (cont’d) • The auxiliary function is separated into three independent terms, each respectively corresponds to , , and • Maximization procedure on can be done by maximizing the individual terms separately subject to probability constraints • All these terms have the following form

  31. Solution to Problem 3 - The EM Algorithm (cont’d) • Proof: Apply Lagrange Multiplier Constraint

  32. wi yi Solution to Problem 3 - The EM Algorithm (cont’d)

  33. wj yj Solution to Problem 3 - The EM Algorithm (cont’d)

  34. wk yk Solution to Problem 3 - The EM Algorithm (cont’d)

  35. Solution to Problem 3 - The EM Algorithm (cont’d) • The new model parameter set can be expressed as:

  36. Discrete vs. Continuous Density HMMs • Two major types of HMMs according to the observations • Discrete and finite observation: • The observations that all distinct states generate are finite in number, i.e., V={v1, v2, v3, ……, vM}, vkRL • In this case, the observation probability distribution in state j, B={bj(k)}, is defined as bj(k)=P(ot=vk|qt=j), 1kM, 1jNot :observation at time t, qt: state at time t bj(k) consists of only M probability values • Continuous and infinite observation: • The observations that all distinct states generate are infinite and continuous, i.e., V={v| vRL} • In this case, the observation probability distribution in state j, B={bj(v)}, is defined as bj(v)=f(ot=v|qt=j), 1jNot :observation at time t, qt: state at time t bj(v) is a continuous probability density function (pdf) and is often a mixture of Multivariate Gaussian (Normal) Distributions

  37. Gaussian Distribution • A continuous random variable X is said to have a Gaussian distribution with mean μand variance σ2(σ>0) if X has a continuous pdf in the following form:

  38. Multivariate Gaussian Distribution • If X=(X1,X2,X3,…,Xd) is an d-dimensional random vector with a multivariate Gaussian distribution with mean vectorand covariance matrix, then the pdf can be expressed as • If X1,X2,X3,…,Xdare independent random variables, the covariance matrix is reduced to diagonal, i.e.,

  39. Observation vector Mean vector of the k-th mixture of the j-th state Covariance matrix of the k-th mixture of the j-th state Multivariate Mixture Gaussian Distribution • An d-dimensional random vector X=(X1,X2,X3,…,Xd) is with a multivariate mixture Gaussian distribution if • In CDHMM,bj(v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions

  40. Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM • Assume that we have a training set of observations and an initial estimate of model parameters • Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by Viterbi Algorithm • Step 2 : Re-estimate the model parameters • Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

  41. Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM (cont’d) • 3 states and 4 Gaussian mixtures per state State s3 s3 s3 s3 s3 s3 s3 s3 s3 s2 s2 s2 s2 s2 s2 s2 s2 s2 s1 s1 s1 s1 s1 s1 s1 s1 s1 1 2 t O1 O2 Ot {12,12,c12} {11,11,c11} K-means Global mean Cluster 1 mean Cluster 2mean {13,13,c13} {14,14,c14}

  42. Observation-independent assumption Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM • Define a new variable t(j,k) • Probability of being in state j at timet with the k-th mixture component accounting for ot, givenOand

  43. Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM (cont’d) • Re-estimation formulae for are

  44. A Simple Example The Forward/Backward Procedure S1 S1 S1 State S2 S2 S2 1 2 3 Time o1 o2 o3

  45. A Simple Example(cont’d) q: 1 1 1 q: 1 1 2 Total 8 paths

  46. A Simple Example(cont’d) back

More Related