1 / 33

Hidden Markov Models

Hidden Markov Models. M. Vijay Venkatesh. Outline . Introduction Graphical Model Parameterization Inference Summary. Introduction. Hidden Markov Model (HMM) is a graphical model for modeling sequential data.

donkor
Download Presentation

Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models M. Vijay Venkatesh

  2. Outline • Introduction • Graphical Model • Parameterization • Inference • Summary

  3. Introduction • Hidden Markov Model (HMM) is a graphical model for modeling sequential data. • The states are no longer independent, but a state at any given step depends on the choice of the previous step. • Generalizations of Mixture models with a transition matrix linking states at neighboring steps.

  4. Introduction • Inferencing in HMM involves having the observed data as input and yielding a probability distribution on the underlying states. • Since the states are dependent, it is a little more involved than inferencing for mixture models.

  5. Generation of data for IID and HMM case

  6. Graphical Model Q1 Q2 Q0 QT A A π Y0 Y2 YT Y1 Top node in each slice represents the multinomial Qt variable and the bottom node represents the observable Yt variable

  7. Graphical Model • Conditioning on state Qt renders Qt-1 and Qt+1 independent. • Generally, Qs is independent of Qu, for s<t and t<u. • This is also for output nodes Ys and Yu, when conditioned on state node Qt. • Conditioning on output node, does not yield any conditional independence. • Indeed, conditioning on all output nodes fails to induce any independencies on state nodes.

  8. Parameterization • State transition matrix A where of A is defined as the transition probability • Each output node has a single state node as a parent, therefore we require probability • For a particular configuration, the joint probability is expressed as

  9. Parameterization • To introduce A and π parameters in the joint probability equation, we re-write the transition matrix indices and the unconditional initial node distribution as • We get the joint probability as

  10. Inferencing • The general inference problem is to compute the probability of hidden state q given an observable output sequence y. • Marginal probability of a particular hidden state qtgiven output sequence. • Probabilities conditioned on partial output • Filtering • Prediction • Smoothing, where we calculate a posterior probability based on data up to and including future time

  11. Inference • Let’s calculate where , is the entire observable output. • We can calculate • But to calculate , we need to sum across all possible values of hidden states • Each state can take M possible values and we have T state nodes, which implies that we must perform MT sums

  12. Inference • Each factor involves only one or two state variables. • It is possible to move those sums inside the product to do it in a systematic way • Moving sum inside and forming a recursive form, reduces computation significantly

  13. Inferencing • Rather than computing P(q|y) we focus on a particular state node qt and calculate P(qt|y) • We take advantage of conditional independencies and Bayes rule Qt+1 Qt A Yt+1 Yt

  14. Inferencing • where α(qt) is the probability of emitting a partial sequence of outputs y0, …,yt and ending at state qt • where β(qt) is the probability of emitting a partial sequence of outputs yt+1, …,yT starting at state qt

  15. Inferencing • Reduced to finding α,β • We hope to obtain a recursive relation between α(qt) and α(qt+1) • The required time is O(M2T) and the algorithm proceeds forward in time • Similarly we obtain a recursive backward relation between β(qt) and β(qt+1) • To compute posterior probabilities for all states qt, we are required to compute alphas and betas for each step.

  16. Alternate inference algorithm • An alternative approach in which the backward phase is a recursion defined on γ(qt) variable • Backward phase does not use yt; only the forward phase does. We can throw data as we filter.

  17. Alternate Inference algorithm • This recursion makes use of the α variables, and hence must be computed before γ recursion • The data yt are not used in γ recursion; the α recursion has absorbed all the necessary likelihoods

  18. Transition matrix • The α-β or α-γ algorithm provides us with posterior probability of the state • To estimate state transition matrix A, we need the matrix of co-occurrence prob. P(qt,qt+1|y) • We calculate ξ(qt,qt+1) based on alphas and betas

  19. Junction Tree Connection • We can calculate all the posterior probability for HMM recursively • Given an observed sequence y, we run α-recursion forward in time • If we require likelihood, we simply sum the alphas at final time step • If we require posterior probabilities of the states, we use either β or γ-recursion

  20. Junction tree connection • HMM is represented by multinomial state variable Qt and the observable output variable yt • HMM is parameterized by initial probability π and each subsequent state node with a transition matrix A where • The output nodes are assigned the local conditional probability . We assume that yt is a multinomial node so that can be viewed as a matrix B. • To convert HMM to Junction Tree, we moralize, triangulate and form the clique tree. Then we choose a maximal spanning tree which forms our junction tree.

  21. Junction tree connection • Moralized and triangulated graph • The junction tree for HMM with potentials labeled

  22. Junction tree Connection • The initial probability as well as the conditional prob. is assigned to the potential , which implies that this potential is initially set to • The state to state potentials are given the assignment , the output probabilities are assigned the potential and the separator potentials are initialized to one.

  23. Unconditional Inference • Lets do inferencing before any evidence is observed and we designate the node as the root and collect to the root. • Consider the first operation of passing a message upward from a clique for t>1. • The marginalization yields • Thus the separator potential remains set at one. • This implies that the update factor is one and thus the potential remains unchanged. • In general , the messages passed upward from leaves have no effect when no evidence is observed

  24. Unconditional inference • Now consider message from (qo,y0) to (q0,q1) • This transformation propagates forward along the chain, changing separator potentials on qt into marginals P(qt) and the clique potentials into marginals P(qt,qt+1) • A subsequent distribute evidence will have no effect on potentials along the backbone of the chain, but will convert into marginals P(qt) and the potentials Ψ(qt, yt) into marginals P(qt, yt)

  25. Unconditional inference • Thus all potentials throughout the junction tree become marginal probabilities • Our results helps to clarify the representation of the joint probability as the product of the clique potentials divided by the product of the separator potentials.

  26. Junction Tree Algorithm • Moralize if needed • Triangulate using any triangulation algorithm • Formulate the clique graph (clique nodes and separator nodes) • Compute the junction tree • Initialize all separator potentials to be 1. • Phase 1: Collect from children • Phase 2: Distribute to children Message from children C: *(XS)=C\S(XC) Update at parent P: *(XP)= (XP) S *(XS)/(XS) Message from parent P: **(XP)=P\S**(XP) Update at child C: *(XC)= (XC) S **(XS)/*(XS)

  27. Introducing evidence • We now suppose that outputs y are observed and we wish to compute P(y) as well as marginal posterior prob. such as P(qt|y) and P(qt,qt+1|y). • Initialize separator potentials to unity and recall that Ψ(qt,yt) can be viewed as a matrix B, with columns labeled by possible values of yt. • In practice we would set the separator potential • We designate (QT-1, QT) as the root of the JT and collect to the root

  28. Collecting to the root • Consider update of clique( Qt, Qt+1) as shown and we assume that Φ*(qt) has already been updated and consider the computation of Ψ*(qt,qt+1) and Φ*(qt+1) • Ψ*(qt,qt+1) = Ψ(qt,qt+1) Φ*(qt) ς*(qt+1) = aqt,qt+1Φ*(qt) P(yt+1|qt+1)

  29. Collecting to the root • Proceeding forward along the chain • Defining α(qt) = Φ*(qt), we have recovered the alpha algorithm • The collect phase of algorithm terminates with update of Ψ(qT-1, qT). The updated potential will equal p(yo,…yt,qt,qt+1) and thus by marginalization we get likelihood

  30. Collecting to the root • Suppose instead of designating (qT-1, qT) as the root, if we utilize (qo,q1) as the root, we obtain the beta algorithm. • It is not necessary to change the root of the JT to derive the beta algorithm. It arises during the DistributeEvidence pass when having (qT-1,qT) as the root.

  31. Distributing from root • Now in the second phase we want to distribute evidence from the root (qT-1,qT) • This phase proceeds backwards along state-state as well as state-output cliques

  32. Distribute from the root • We suppose that the separator potential Φ**(qt+1) has already been updated and consider the update of Ψ**(qt,qt+1) and Φ**(qt) and • Simplifying we obtain Gamma recursion

  33. Distribution from the root • By rearranging and simplifying we can also derive a relationship between alpha-beta recursion

More Related