1 / 51

Hidden Markov Models

Hidden Markov Models. Hsin-Min Wang whm@iis.sinica.edu.tw. References: L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 X. Huang et. al., (2001) Spoken Language Processing, Chapter 8

connie
Download Presentation

Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models Hsin-Min Wang whm@iis.sinica.edu.tw References: L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 X. Huang et. al., (2001) Spoken Language Processing, Chapter 8 L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989

  2. Hidden Markov Model (HMM) • History • Published in Baum’s papers in late 1960s and early 1970s • Introduced to speech processing by Baker (CMU) and Jelinek (IBM) in the 1970s • Introduced to computational biology in late1980s • Lander and Green (1987) used HMMs in the construction of genetic linkage maps • Churchill (1989) employed HMMs to distinguish coding from noncoding regions in DNA

  3. Hidden Markov Model (HMM) • Assumption • Speech signal (DNA sequence) can be characterized as a parametric random process • Parameters can be estimated in a precise, well-defined manner • Three fundamental problems • Evaluation of probability (likelihood) of a sequence of observations given a specific HMM • Determination of a best sequence of model states • Adjustment of model parameters so as to best account for observed signal/sequence

  4. 0.34 S1 {A:.34,B:.33,C:.33} 0.33 0.33 0.33 0.33 0.34 0.33 0.34 S2 S3 0.33 {A:.33,B:.34,C:.33} {A:.33,B:.33,C:.34} We can train HMMs for the following two classes using their training data respectively. Training set for class 1: 1. ABBCABCAABC 2. ABCABC 3. ABCA ABC 4. BBABCAB 5. BCAABCCAB 6. CACCABCA 7. CABCABCA 8. CABCA 9. CABCA Training set for class 2: 1. BBBCCBC 2. CCBABB 3. AACCBBB 4. BBABBAC 5. CCAABBAB 6. BBBCCBAA 7. ABBBBABA 8. CCCCC 9. BBAAA Hidden Markov Model (HMM) back Given an initial model as follows: We can then decide which class the following testing sequences belong to.ABCABCCABAABABCCCCBBB

  5. Bayes’ rule maximum likelihood principle Probability Theorem • Consider the simple scenario of rolling two dice, labeled die 1 and die 2. • Define the following three events: • A: Die 1 lands on 3. • B: Die 2 lands on 1. • C: The dice sum to 8. • Prior probability: P(A)=P(B)=1/6, P(C)=5/36. • Joint probability: P(A,B) (or P(A∩B)) =1/36, two events A and B are • statistically independent if and only if P(A,B) = P(A)xP(B). • P(B,C)=0, two events B and C are mutually exclusive if and • only if B∩C=Φ, i.e., P(B∩C)=0. • Conditional probability: , P(B|A)=P(B), P(C|B)=0 {(2,6), (3,5), (4,4), (5,3), (6,2)} A∩B ={(3,1)} B∩C=Φ Posterior probability

  6. First-order Markov chain The Markov Chain

  7. (Rabiner 1989) Observable Markov Model • The parameters of a Markov chain, with N states labeled by {1,…,N} and the state at time t in the Markov chain denoted as qt, can be described as aij=P(qt= j|qt-1=i) 1≤i,j≤N i =P(q1=i) 1≤i≤N • The output of the process is the set of states at each time instant t, where each state corresponds to an observable event Xi • There is a one-to-one correspondence between the observable sequence and the Markov chain state sequence

  8. 0.6 S1 A 0.3 0.3 0.1 0.1 0.2 0.5 S2 S3 0.7 0.2 C B The Markov Chain – Ex 1 • A 3-state Markov Chain  • State 1 generates symbol Aonly, State 2 generates symbol Bonly, State 3 generates symbol Conly • Given a sequence of observed symbols O={CABBCABC}, the only one corresponding state sequence is Q={S3S1S2S2S3S1S2S3}, and the corresponding probability is P(O|)=P(CABBCABC|)=P(Q|  )=P(S3S1S2S2S3S1S2S3 |) =π(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0.10.30.30.70.20.30.30.2=0.00002268

  9. The Markov Chain – Ex 2 • A three-state Markov chain for the Dow Jones Industrial average The probability of 5 consecutive up days (Huang et al., 2001)

  10. Extension to Hidden Markov Models • HMM: an extended version of Observable Markov Model • The observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a state • The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden) • What is hidden? The State Sequence!According to the observation sequence, we are not sure which state sequence generates it!

  11. 0.6 Initial model S1 {A:.3,B:.2,C:.5} 0.3 0.3 0.1 0.1 0.2 0.5 S2 S3 0.7 0.2 {A:.7,B:.1,C:.2} {A:.3,B:.6,C:.1} Hidden Markov Models – Ex 1 • A 3-state discrete HMM  • Given an observation sequence O={ABC}, there are 27 possible corresponding state sequences, and therefore the probability, P(O|), is

  12. cf. the Markov chain Hidden Markov Models – Ex 2 Given a three-state Hidden Markov Model for the Dow Jones Industrial average as follows: (Huang et al., 2001) (35 state sequences can generate “up, up, up, up, up”.) • How to find the probabilityP(up, up, up, up, up|)? • How to find the optimal state sequence of the model which generates the observation • sequence “up, up, up, up, up”?

  13. Elements of an HMM • An HMM is characterized by the following: • N, the number of states in the model • M, the number of distinct observation symbols per state • The state transition probability distribution A={aij}, where aij=P[qt+1=j|qt=i], 1≤i,j≤N • The observation symbol probability distribution in statej, B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M • The initial state distribution={i},wherei=P[q1=i], 1≤i≤N • For convenience, we usually use a compact notation =(A,B,) to indicate the complete parameter set of an HMM • Requires specification of two model parameters (N and M)

  14. Two Major Assumptions for HMM • First-order Markov assumption • The state transition depends only on the origin and destination • The state transition probability is time invariant • Output-independent assumption • The observation is dependent on the state that generates it, not dependent on its neighbor observations aij=P(qt+1=j|qt=i), 1≤i, j≤N

  15. P(up, up, up, up, up|)? Three Basic Problems for HMMs • Given an observation sequence O=(o1,o2,…,oT),and an HMM =(A,B,) • Problem 1:How to compute P(O|) efficiently ? Evaluation Problem • Problem 2:How to choose an optimal state sequence Q=(q1,q2,……, qT)which best explains the observations?Decoding Problem • Problem 3:How to adjust the model parameters =(A,B,)to maximizeP(O|)?Learning/Training Problem

  16. Solution to Problem 1

  17. Solution to Problem 1 - Direct Evaluation Given O and , find P(O|)= Pr{observing O given } • Evaluating all possible state sequences of length Tthat generate the observation sequence O • : The probability of the path Q • By first-order Markov assumption • : The joint output probability along the path Q • By output-independent assumption

  18. State S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT Solution to Problem 1 - Direct Evaluation (cont’d) … Sj means bj(ot) has been computed aij means aij has been computed

  19. Solution to Problem 1 - Direct Evaluation (cont’d) • A Huge Computation Requirement: O(NT)(NT state sequences) • Exponential computational complexity • A more efficient algorithm can be used to evaluate • The Forward Procedure/Algorithm

  20. Solution to Problem 1 - The Forward Procedure • Base on the HMM assumptions, the calculation of and involves only qt-1, qt, and ot , so it is possible to compute the likelihood with recursion on t • Forward variable : • The probability of the joint event that o1,o2,…,otare observed and the state at time tisi, given the model λ

  21. Output-independent assumption First-order Markov assumption Solution to Problem 1 - The Forward Procedure (cont’d)

  22. Solution to Problem 1 - The Forward Procedure (cont’d) State index • 3(2)=P(o1,o2,o3,q3=2|) =[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3) Time index State S3 S3 S3 S3 S3 a32 2(3) b2(o3) S2 S2 S2 S2 S2 a22 2(2) a12 S1 S1 S1 S1 S1 2(1) 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT Sj means bj(ot) has been computed aij means aij has been computed

  23. Solution to Problem 1 - The Forward Procedure (cont’d) • Algorithm • Complexity: O(N2T) • Based on the lattice (trellis) structure • Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1 • All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t cf. O(NT) for direct evaluation

  24. α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7 α1(1)=0.5*0.7 α1(2)= 0.2*0.1 α2(2)=(0.35*0.2+0.02*0.3+0.09*0.1)*0.1 α1(3)= 0.3*0.3 α2(3)=(0.35*0.2+0.02*0.2+0.09*0.5)*0.3 Solution to Problem 1 - The Forward Procedure (cont’d) • A three-state Hidden Markov Model for the Dow Jones Industrial average a11=0.6 a21=0.5 b1(up)=0.7 π1=0.5 b1(up)=0.7 a31=0.4 a12=0.2 a22=0.3 π2=0.2 b2(up)= 0.1 b2(up)= 0.1 a23=0.2 a13=0.2 a32=0.1 a33=0.5 π3=0.3 b3(up)=0.3 b3(up)=0.3 (Huang et al., 2001) P(up, up|) = α2(1)+α2(2)+α2(3)

  25. Solution to Problem 2

  26. Solution to Problem 2 - The Viterbi Algorithm • The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm • Instead of summing probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path • Find a single optimal state sequence Q* • The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

  27. Solution to Problem 2 - The Viterbi Algorithm(cont’d) State S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

  28. Solution to Problem 2 - The Viterbi Algorithm(cont’d) • Initialization • Induction • Termination • Backtracking is the best state sequence Complexity: O(N2T)

  29. δ1(1)=0.5*0.7 δ2(1) =max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7 a11=0.6 a21=0.5 b1(up)=0.7 δ2(1)= 0.35*0.6*0.7=0.147 Ψ2(1)=1 π1=0.5 b1(up)=0.7 a31=0.4 δ1(2)= 0.2*0.1 δ2(2) =max (0.35*0.2, 0.02*0.3, 0.09*0.1)*0.1 δ2(2)= 0.35*0.2*0.1=0.007 Ψ2(2)=1 π2=0.2 b2(up)= 0.1 b2(up)= 0.1 δ1(3)= 0.3*0.3 δ2(3) =max (0.35*0.2, 0.02*0.2, 0.09*0.5)*0.3 δ2(3)= 0.35*0.2*0.3=0.021 Ψ2(3)=1 π3=0.3 b3(up)=0.3 b3(up)=0.3 Solution to Problem 2 - The Viterbi Algorithm(cont’d) • A three-state Hidden Markov Model for the Dow Jones Industrial average a12=0.2 a22=0.3 a13=0.2 a32=0.1 a23=0.2 0.09 a33=0.5 (Huang et al., 2001) The most likely state sequence that generates “up up”: 1 1

  30. Some Examples

  31. S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 S3 S3 S2 S2 S1 S1 Isolated Digit Recognition 1 S3 S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S2 0 S1 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

  32. S3 S2 S1 Continuous Digit Recognition S6 S6 S6 S6 S6 S6 S6 1 S5 S5 S5 S5 S5 S5 S5 S4 S4 S4 S4 S4 S4 S4 S3 S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S2 0 S1 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

  33. Continuous Digit Recognition (cont’d) S6 S6 S6 S6 S6 S6 S6 S6 S6 1 S5 S5 S5 S5 S5 S5 S5 S5 S5 S4 S4 S4 S4 S4 S4 S4 S4 S4 S3 S3 S3 S3 S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S2 0 S2 S2 S2 S1 S1 S1 S1 S1 S1 S1 S1 S1 1 2 3 4 5 6 7 8 9 Time Best state sequence S1 S1 S2 S3 S4 S5 S3 S5 S6

  34. CpG Islands Two Questions • Q1: Given a short sequence, does it come from a CpG island? • Q2: Given a long sequence, how would we find the CpG islands in it?

  35. CpG Islands • Answer to Q1: • Given sequence x, probabilistic model M1 of CpG islands, and probabilistic model M2 for non-CpG island regions • Compute p1=P(x|M1) and p2=P(x|M2) • If p1 > p2, then x comes from a CpG island (CpG+) • If p2 > p1, then x does not come from a CpG island (CpG-) Large CG transition probability vs. Small CG transition probability S1:A S2:C S3:T S4:G

  36. CpG Islands p12=0.00001 • Answer to Q2: p22=0.9999 S1 S2 p11=0.99999 p21=0.0001 CpG+ CpG- A: 0.3 C: 0.2 G: 0.2 T: 0.3 A: 0.2 C: 0.3 G: 0.3 T: 0.2 Hidden S1 S1 S1 S2 S2 S1 S2 S2 S1 … A C T C G A G T A … Observable

  37. A Toy Example: 5’ Splice Site Recognition • 5’ splice site indicates the “switch” from an exon to an intron • Assumptions: • Uniform base composition on average in exons (25% each base) • Introns are A/T rich (40% A/T, and 10% C/G) • The 5’SS consensus nucleotide is almost always a G (say, 95% G and 5% A) From “What is a hidden Markov Model?”, by Sean R. Eddy

  38. A Toy Example: 5’ Splice Site Recognition

  39. Solution to Problem 3

  40. Solution to Problem 3 – Maximum Likelihood Estimation of Model Parameters • How to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)? • The most difficult one among the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form • The data is incomplete because of the hidden state sequence • The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm • The EM (Expectation Maximization) algorithm is perfectly suitable for this problem • Alternatively, it can be solved by the iterative segmental K-means algorithm • The model parameters are adjusted to maximize P(O,Q* |), Q*is the state sequence given by the Viterbi algorithm • Provide a good initialization of Baum-Welch training

  41. Solution to Problem 3 – The Segmental K-means Algorithm • Assume that we have a training set of observations and an initial estimate of model parameters • Step 1 : Segment the training data The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm • Step 2 : Re-estimate the model parameters • Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

  42. State s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s1 s1 s1 s1 s1 s1 s1 s1 s1 s1 1 2 3 4 5 6 7 8 9 10 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 A B Solution to Problem 3 – The Segmental K-means Algorithm (cont’d) • 3 states and 2 codewords • π1=1, π2=π3=0 • a11=3/4, a12=1/4 • a22=2/3, a23=1/3 • a33=1 • b1(A)=3/4, b1(B)=1/4 • b2(A)=1/3, b2(B)=2/3 • b3(A)=2/3, b3(B)=1/3 Training data: Re-estimated parameters: What if the training data is labeled?

  43. S3 S3 S2 S2 S1 S3 Solution to Problem 3 – The Backward Procedure • Backward variable : • The probability of the partial observation sequence ot+1,ot+2,…,oT, given state i at time t and the model  • 2(3)=P(o3,o4,…, oT|q2=3,) =a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3) State S3 S3 S3 S3 S2 S2 S2 S2 a31 S1 S1 S1 S1 b1(o3) 3(1) 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

  44. cf. Solution to Problem 3 – The Backward Procedure (cont’d) • Algorithm

  45. Solution to Problem 3 – The Forward-Backward Algorithm • Relation between the forward and backward variables (Huang et al., 2001)

  46. Solution to Problem 3 – The Forward-Backward Algorithm (cont’d)

  47. Solution to Problem 3 – The Intuitive View • Define two new variables:t(i)= P(qt = i | O, ) • Probability of being in state i at time t, givenOand t( i, j )=P(qt = i, qt+1 = j | O, ) • Probability of being in state i at time t and state jat time t+1, givenOand

  48. S3 S3 S3 s1 S3 s1 S3 S2 S2 S2 S2 s2 s2 S2 s3 S1 S1 S1 S1 s3 S1 Solution to Problem 3 – The Intuitive View (cont’d) • P(q3 = 1, O |)=3(1)*3(1) State S3 S2 S1 3(1) 3(1) 1 2 3 4 T-1 T Time o1 o2 o3 oT-1 oT

  49. S3 S3 S3 s1 s1 S3 s2 S2 S2 S2 s2 S2 s3 S1 S1 S1 s3 S1 Solution to Problem 3 – The Intuitive View (cont’d) • P(q3 = 1, q4 = 3, O |)=3(1)*a13*b3(o4)*4(3) b3(o4) 4(3) State S3 S3 S2 S2 a13 S1 S1 3(1) 1 2 3 4 T-1 T Time o1 o2 o3 oT-1 oT

  50. Solution to Problem 3 – The Intuitive View(cont’d) • t( i, j )=P(qt = i, qt+1 = j | O,) • t(i)= P(qt = i | O,)

More Related