Hidden Markov Models

Hidden Markov Models BIOL337/STAT337/437 Spring Semester 2014

1 1 1 1 1 … 2 2 2 2 2 2 … An HHM … … … … K K K K K … π1 π2 π3 … πn xn x1 x2 x3 … • Theory of hidden Markov models (HMMs) • Probabilistic interpretation of sequence alignments using HMMs • Applications of HMMs to biological sequence modeling and discovery of features such as genes, etc.

Example of an HHM The Dishonest Casino Do you want to play?

The situation... • Casino has two dice, one fair (F) and one loaded (L) • Probabilities for the fair die: P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • Probabilities for the loaded die: P(1) = P(2) = P(3) = P(4) = P(5) = 1/10; P(6) = ½ • Before each roll, the casino player switches from the fair die to the loaded die (or vice versa) with probability 1/20 The game... • You bet $1 • You roll (always with the fair die) • Casino player rolls (maybe with the fair die, maybe with the loaded die) • Player who rolls the highest number wins $2

Dishonest Casino HMM 0.05 0.95 0.95 FAIR LOADED P(1 | F) = 1/6 P(2 | F) = 1/6 P(3 | F) = 1/6 P(4 | F) = 1/6 P(5 | F) = 1/6 P(6 | F) = 1/6 P(1 | L) = 1/10 P(2 | L) = 1/10 P(3 | L) = 1/10 P(4 | L) = 1/10 P(5 | L) = 1/10 P(6 | L) = 1/2 0.05

Three Fundamental Questions About Any HMM • Evaluation: What is the probability of a sequence of outputs of an HMM? • Decoding: Given a sequence of outputs of an HMM, what is the most probable sequence of states that the HMM went through to produce the output? • Learning: Given a sequence of outputs of an HMM, how do we estimate the parameters of the model?

Evaluation Question Suppose the casino player rolls the following sequence... 1 2 4 5 5 2 6 4 6 2 1 4 6 1 4 6 1 3 6 1 3 6 6 6 1 6 6 4 6 6 1 6 3 6 6 1 6 3 6 6 1 6 3 6 1 6 5 1 5 6 1 5 1 1 5 1 4 6 1 2 3 5 6 2 3 4 4 Probability = 1.3 x 10-35 (Note (1/6)67 = 7.309139054 x 10-53.) How likely is this sequence given our model of how the casino operates?

Decoding Question 1 2 4 5 5 2 6 4 6 2 1 4 6 1 4 6 1 3 6 1 3 6 6 6 1 6 6 4 6 6 1 6 3 6 6 1 6 3 6 6 1 6 3 6 1 6 5 1 5 6 1 5 1 1 5 1 4 6 1 2 3 5 6 2 3 4 4 LOADED FAIR What portion of the sequence was generated with the fair die, and what portion with the loaded die?

Learning Question 1 2 4 5 5 2 6 4 6 2 1 4 6 1 4 6 1 3 6 1 3 6 6 6 1 6 6 4 6 6 1 6 3 6 6 1 6 3 6 6 1 6 3 6 1 6 5 1 5 6 1 5 1 1 5 1 4 6 1 2 3 5 6 2 3 4 4 LOADED P(6) = 0.64 FAIR How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? That is, what are the parameters of the model?

K K Σaij = 1 Σa0j = 1 j=1 j=1 Ingredients of a HMM An HMM M has the following parts... • AlphabetΣ = {b1, b2, ..., bm} //these symbols are output by the model, for example, Σ = {1,2,3,4,5,6} in the dishonest casino model • Set of statesQ = {1, 2, ..., K} //for example, ‘FAIR’ and ‘LOADED’ • Transition probabilities between statesai,j = probability of making a transition from state i to state j • Starting probabilitiesa0j = probability of the model starting in state j

1 ak,k ak1 2 ak2 k .… • ek(b1) = P(xi = b1 | πi = k) • ek(b2) = P(xi = b2 | πi = k) • ek(bm) = P(xi = bm | πi = k) ak,K K .… • Emission probabilities within state k, ek(b) = probability of seeing (emitting) the symbol b output while in state k, that is, ek(b) = P(xi = b | πi = k)

Some Notation • π = (π1,π2,...,πt) • x1,x2,...,xt,π1,π2,...,πt πi= state occupied after i steps πi= state occupied after i steps, xi emitted in state i

1 1 1 1 … 2 2 2 2 … K K K K … x3 x2 x1 xn x1,x2,...,xn,π1,π2,...,πn … … … …

‘Forgetfulness’property: if the current state is πt, the next stateπt+1 depends only on πtand nothing else P(t+1 = k | ‘whatever happened so far’) = P(t+1 = k | x1,x2,…,xt,1,2,…,t) = P(t+1 = k | t) The ‘forgetfulness’ property is part of the definition of an HMM!

What is the probability of x1,x2,...,xn,π1,π2,...,πn? Example: Take n = 2. P(x1,x2,π1,π2) = P(x1,π1,x2,π2) = P(x2,π2 | x1,π1)·P(x1,π1) (conditional probability) = P(x2 | π2)·P(π2 | x1,π1)·P(x1,π1) (conditional probability) = P(x2 | π2)·P(π2 | π1)·P(x1,π1) (‘forgetfulness’) = P(x2 | π2)·P(π2 | π1)·P(x1 | π1)·P(π1) (conditional probability) = eπ(x2)·eπ(x1)·aππ·a0π 2 1 1 2 1

n P(x1,...,xn,π1,...,πn) = Πeπ(xk)·aππ k=1 k k-1 k a21 a1K aK* a*2 2 1 … K 2 … e2(x1) e1(x2) eK(x3) e2(xn) x1 x2 x3 xn In general, for the sequence x1,x2,...,xn,π1,π2,...,πn, we have that

The Dishonest Casino (cont.) x = 1 2 1 5 6 2 1 5 2 4 π = F F F F F F F F F F Do you want to play?

Suppose the starting probabilities are as follows: a0F = ½ a0L = ½ P(x,π) = a0FeF(1)·aFFeF(2)· aFFeF(1)· aFFeF(5)· aFFeF(6)· aFFeF(2)· aFFeF(1)· aFFeF(5)· aFFeF(2) aFFeF(4) = ½·(1/6)10  (0.95)9 = .00000000521158647211 ≈ 5.21  10-9 Now suppose that π = L L L L L L L L L L P(x,π) = a0LeL(1)·aLLeL(2)· aLLeL(1)· aLLeL(5)· aLLeL(6)· aLLeL(2)· aLLeL(1)· aLLeL(5)· aLLeL(2) aLLeL(4) = ½·(1/2)1·(1/10)9·(0.95)9 = .00000000015756235243 ≈ 0.16  10-9

x = 1 6 6 5 6 2 6 6 3 6 π = F F F F F F F F F F P(x,π) = a0FeF(1)·aFFeF(6)· aFFeF(6)· aFFeF(5)· aFFeF(6)· aFFeF(2)· aFFeF(6)· aFFeF(6)· aFFeF(3) aFFeF(6) = ½·(1/6)10  (0.95)9 = .00000000521158647211 ≈ 5.21  10-9 Now suppose that π = L L L L L L L L L L ≈100 times more likely! P(x,π) = a0LeL(1)·aLLeL(6)· aLLeL(6)· aLLeL(5)· aLLeL(6)· aLLeL(2)· aLLeL(6)· aLLeL(6)· aLLeL(3) aLLeL(6) = ½·(1/2)6·(1/10)4·(0.95)9 = .00000049238235134735 ≈ 492  10-9

Three Fundamental Problems • EvaluationProblem(solved by Forward-Backward algorithm) GIVEN: an HMM M and a sequence x FIND: P(x | M) • DecodingProblem(solved by Viterbi algorithm) GIVEN: an HMM M and a sequence x FIND: sequence  of states that maximizes P(x, | M) • LearningProblem(solved by Baum-Welch algorithm) GIVEN: a HMM M, with unspecified transition/emission probabilities and a sequence x FIND: parameter vector  = (ei(.), aij) that maximize P(x | )

Let’s not get confused by notation...(lots of different ones!) P(x | M) : probability that x is generated by the model M (the model M consists of the transition and emission probabilities and the architecture of the HMM, that is, the underlying directed graph) P(x | θ) : probability that x is generated by the model M where θ is the vector of parameter values, that, is the transition and emission probabilities (note that P(x | M) is equivalent to P(x | θ)) P(x) : same as P(x | M) and P(x | θ) : probability that x is generated by the model and π is the sequence of states that produced x P(x, | M),P(x, | ) and P(x,)

1 1 1 1 … 2 2 2 2 … … … … … K K K K … xn x3 x2 x1 Decoding Problem for HMMs π2 π πn π1 π3 π = π1,π2,…,πn GIVEN: Sequence x = x1x2… xn generated by the model M FIND: Path π = π1π2…πnthat maximizes P(x,π)

n P(x1,...,xn,π1,...,πn) = Πeπ(xk)·aππ k=1 k k-1 k Formally, the Decoding Problem for HHMs is to find the following given a sequence x = x1x2… xn generated by the model M: π* = argmax {P(x,π | M)} P* = max {P(x,π | M)} =P(x, π* | M)} π π

Let Vk(i) denote the probability of themaximum probability path from stage 1 to stage i ending in state k generating xi in state k. Can we write an equation (hopefully recursive) for Vk(i)? Vk(i) = max {P(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M)} = ek(xi)·max {ajk·Vj(i-1)} π1,...,πi-1 (proof using properties of conditional probabilities...) … j Recursive equation... so...dynamic programming!

V1(i-1) 1 Stage i, State k a1k ak1 V2(i-1) 2 a2k ak2 k … … Stage (i – 1) aKk akK ek(xi) K xi VK-1(i-1)

1 2 K How do we start the algorithm? … Stage 1 Stage 2 Stage 3 Stage n V1(0) = 0 1 1 1 1 … 2 2 2 2 V2(0) = 0 … … … … … … K K K K VK(0) = 0 … a01 a02 a0k 0 V0(0) = 1 x1 x2 x3 xn Initialization step is needed to start the algorithm! (‘dummy’ 0th stage)

Viterbi Algorithm for Decoding an HMM 1. Initialization Step. //initialize matrix V0(0) = 1 Vj(0) = 0 for all j > 0 2. Main Iteration.//fill in table for eachi = 1 ton for eachk = 1 toK Vk(i) = ek(xi)·max {ajk·Vj(i-1)} Ptrk(i) = argmax {ajk·Vj(i-1)} 3. Termination.//recover optimal probability and path P* = max {P(x,π | M)} = max {Vj(n)} π* = argmax {Vj(n)} for eachi = n – 1 downto 1 π* = Ptrπ (i+1) Time: O(K2n) Space: O(Kn) j j π j n j i i+1

V1(1) = e1(x1)·a01 V2(1) = e2(x1)·a02 x1 x2 x3 xn … 1 2 States … K Vk(2) = ek(x2)·max {ajk·Vj(1)} VK(1) = eK(x1)·a0K j

Computational Example of the Viterbi Algorithm 0.05 0.95 0.95 FAIR LOADED P(1 | F) = 1/6 P(2 | F) = 1/6 P(3 | F) = 1/6 P(4 | F) = 1/6 P(5 | F) = 1/6 P(6 | F) = 1/6 P(1 | L) = 1/10 P(2 | L) = 1/10 P(3 | L) = 1/10 P(4 | L) = 1/10 P(5 | L) = 1/10 P(6 | L) = 1/2 0.05 x = 2 6 6

x = 2 6 6 2 6 6 F L P* = 0.01128125 π = LLL P(266,LLL) = (1/2)3 ·(1/10)·(95/100)2 = 361/32000 = 0.01128125

How well does Viterbi perform? 300 rolls by the casino Viterbi is correct 91% of the time!

Problem of Underflow in the Viterbi Algorithm Vk(i) = max {P(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M)} = ek(xi)·max {ajk·Vj(i-1)} π1,...,πi-1 j Numbers become very small since probabilities are being multiplied together! Compute with the logarithms of the probabilities to reduce the occurrence of underflow! Vk(i) = log(ek(xi)) + max {log(ajk) + Vj(i-1)} j

Three Fundamental Problems • EvaluationProblem(solved by Forward-Backward algorithm) GIVEN: an HMM M and a sequence x FIND: P(x | M) • DecodingProblem(solved by Viterbi algorithm) GIVEN: an HMM M and a sequence x FIND: sequence  of states that maximizes P(x, | M) • LearningProblem(solved by Baum-Welch algorithm) GIVEN: a HMM M, with unspecified transition/emission probabilities and a sequence x FIND: parameter vector  = (ei(.), aij) that maximize P(x | ) Done!

1 1 1 1 … 2 2 2 2 … … … … … K K K K … x3 x2 x1 xn Evaluation Problem for HMMs π2 π πn π1 π3 GIVEN: Sequence x = x1x2… xn generated by the model M FIND: Find P(x), the probability of x given the model M

Formally, the Evaluation Problem for HHMs is to find the following given a sequence x = x1x2… xn generated by the model M: P(x) = ΣP(x,π | M) = ΣP(x | π)·P(π) π π Exponential number of paths π ! Since there are an exponential number of paths, specifically Kn, the probability P(x) cannot be computed directly. So...dynamic programming again!

Let fk(i) denote the probability of the subsequence x1x2,...,xi of x such that πi = k. The quantity fk(i) is called the forward probability. Can we write an equation (hopefully recursive) for fk(i)? fk(i) = ΣP(x1,...,xi-1,π1,...,πi-1,xi,πi = k | M) = ek(xi)· Σ {ajk·fj(i-1)} π1,...,πi-1 (proof using properties of conditional probabilities...) … j Recursive equation suitable for dynamic programming

f1(i-1) 1 Stage i, State k a1k ak1 f2(i-1) 2 a2k ak2 k … … Stage (i – 1) aKk akK ek(xi) K xi fK(i-1)

1 2 K How do we start the algorithm? … Stage 1 Stage 2 Stage 3 Stage n f1(0) = 0 1 1 1 1 … 2 2 2 2 f2(0) = 0 … … … … … … K K K K fK(0) = 0 … a01 a02 a0k 0 f0(0) = 1 x1 x2 x3 xn Initialization step is needed to start the algorithm! (‘dummy’ 0th stage)

Forward Algorithm for Evaluation 1. Initialization Step. //initialize matrix f0(0) = 1 fj(0) = 0 for all j > 0 2. Main Iteration.//fill in table for eachi = 1 ton for eachk = 1 toK fk(i) = ek(xi)·Σajk·fj(i-1) 3. Termination.//recover probability of x P(x) = Σfj(n) Time: O(K2n) Space: O(Kn) j j

f1(1) = e1(x1)·a01 f2(1) = e2(x1)·a02 x1 x2 x3 xn … 1 2 States … K fk(2) = ek(x2)· Σ {ajk·fj(1)} fK(1) = eK(x1)·a0K j

Viterbi vs. Forward 1. Initialization Step. //initialize matrix V0(0) = 1 Vj(0) = 0 for all j > 0 2. Main Iteration. //fill in table for eachi = 1 ton for eachk = 1 toK Vk(i) = ek(xi)·max {ajk·Vj(i-1)} Ptrk(i) = argmax {ajk·Vj(i-1)} 3. Termination. //recover optimal probability and path P* = max {P(x,π | M)} = max {Vj(n)} π* = argmax {Vj(n)} for eachi = n – 1 downto 1 π* = Ptrπ (i+1) 1. Initialization Step. //initialize matrix f0(0) = 1 fj(0) = 0 for all j > 0 2. Main Iteration. //fill in table for eachi = 1 ton for eachk = 1 toK fk(i) = ek(xi)·Σajk·fj(i-1) 3. Termination. //recover probability P(x) = Σfj(n) j j j π j j n j i i+1

x = 2 6 6 2 6 6 F L P(x) = P(266) = 8/3375+ 227/18000 = 809/54000 = 0.01498148148

Checking, we see that for x = 266, (1/2*1/10 * (95/100)*(1/2) * (95/100)*(1/2)) + (1/2*1/10 * (95/100)*(1/2) * (1/20)*(1/6)) + (1/2*1/10 * (1/20)*(1/6) * (1/20)*(1/2)) + (1/2*1/10 * (1/20)*(1/6) * (95/100)*(1/6) ) + (1/2*1/6 * (1/20)*(1/2) * (95/100)*(1/2) ) + (1/2*1/6 * (1/20)*(1/2) * (1/20)*(1/6)) + (1/2*1/6 * (95/100)*(1/6) * (1/20)*(1/2)) + (1/2*1/6 * (95/100)*(1/6) * (95/100)*(1/6)) P(x) = ΣP(x,π | M) = π = 0.01498148148

Backward Algorithm: Motivation Suppose we wish to compute the probability that the ith state is k given the observed sequence of outputs x. (Notice that we would then know the density of the random variable πi.) That is, we must compute We start by computing P(πi = k | x) = P(πi = k, x) / P(x) P(πi = k , x) = P(x1,...,xi,πi = k,xi+1,...,xn) = P(x1,...,xi,πi = k) · P(xi+1,...,xn | x1,...,xi,πi = k) new! fk(i) bk(i)

So then, we have the following equation. P(πi = k | x) = P(πi = k , x) / P(x) = fk(i)·bk(i) / P(x) The quantity bk(i) is called the backward probability and is defined by bk(i) = P(xi+1,...,xn | πi = k)

Can we write an equation (hopefully recursive) for bk(i)? bk(i) = P(xi+1,...,xn | πi = k) = Σej(xi+1)·akj·bj(i+1) (proof using properties of conditional probabilities...) … j Recursive equation suitable for dynamic programming

Backward Algorithm for Evaluation 1. Initialization Step. //initialize matrix bj(n) = 1 for all j > 0 2. Main Iteration.//fill in table for eachi = n - 1 downto 1 for eachk = 1 toK bk(i) = Σej(xi+1)·akj·bj(i+1) 3. Termination.//recover probability of x P(x) = Σej(x1)·bj(1)·a0j Time: O(K2n) Space: O(Kn) j j

b1(n) = 1 x1 x2 x3 xn … b2(n) = 1 1 2 States … K bK(n) = 1 bk(n - 1) = Σej(xn)·akj·bj(n) j

x = 2 6 6 2 6 6 F L P(x) = P(266) = 1/6*37/900*1/2+ 1/10*52/225*1/2 = 809/54000 = 0.01498148148

Posterior Decoding There are two types of decoding for an HMM: Viterbi decoding and posterior decoding. P(πi = k | x) = P(πi = k , x) / P(x) = fk(i)·bk(i) / P(x) ^ Now we can ask what is the most probable state at stage i. Let πi denote this state. Clearly, we have ^ πi = argmax {P(πi = k | x)} ^ ^ ^ Therefore, (π1, π2,..., πn) is the sequence of the most probable states. Notice that the above sequence is not (necessarily) the most probable path that the HMM went through to produce x and may not even be a valid path!

Hidden Markov Models