Download Presentation
## Hidden Markov Model

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Example: CpG Island**We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ? We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:**Question 2: Finding CpG Islands**Given a long genomic str with possible CpG Islands, we define a Markov Chain over 8 states, all interconnected (hence it is ergodic): The problem is that we don’t know the sequence of states which are traversed, but just the sequence of letters. A+ C+ G+ T+ A- C- G- T- Therefore we use here Hidden Markov Model**H1**H2 HL-1 HL Hi X1 X2 XL-1 XL Xi Hidden Markov Models - HMM Hidden variables Observed data**M**M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Hidden Markov Model A Markov chain (s1,…,sL): and for each state s and a symbol x we have p(Xi=x|Si=s) Application in communication: message sent is (s1,…,sm) but we receive (x1,…,xm) . Compute what is the most likely message sent ? Application in speech recognition: word said is (s1,…,sm) but we recorded (x1,…,xm) . Compute what is the most likely word said ?**M**M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Hidden Markov Model Notations: Markov Chain transition probabilities: p(Si+1= t|Si= s) = ast Emission probabilities: p(Xi = b| Si = s) = es(b)**Example: The Dishonest Casino**A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: • You bet $1 • You roll (always with a fair die) • Casino player rolls (maybe with fair die, maybe with loaded die) • Highest number wins $2**Question # 1 – Decoding**GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs**Question # 2 – Evaluation**GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs**Question # 3 – Learning**GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs**The dishonest casino model**0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05**1**1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, A parse of x is a sequence of states = 1, ……, N 1 2 2 K x1 x2 x3 xK**1**1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a parse Given a sequence x = x1……xN and a parse = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | x1…xN-1, 1, ……, N-1) P(x1…xN-1, 1, ……, N-1) = P(xN, N | N-1) P(x1…xN-1, 1, ……, N-1) = … = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) = a01 a12……aN-1Ne1(x1)……eN(xN) 1 2 2 K x1 x2 x3 xK**Example: the dishonest casino**Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 Then, what is the likelihood of = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, a0Loaded = ½) ½ P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ (1/6)10 (0.95)9 = .00000000521158647211 = 0.5 10-9**Example: the dishonest casino**So, the likelihood the die is fair in all this run is just 0.521 10-9 OK, but what is the likelihood of = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½ (1/10)8 (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 10-9 Therefore, it somewhat more likely that the die is fair all the way, than that it is loaded all the way**Example: the dishonest casino**Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood = F, F, …, F? ½ (1/6)10 (0.95)9 = 0.5 10-9, same as before What is the likelihood = L, L, …, L? ½ (1/10)4 (1/2)6 (0.95)9 = .00000049238235134735 = 0.5 10-7 So, it is 100 times more likely the die is loaded**H1**H2 HL-1 HL Hi X1 X2 XL-1 XL Xi C-G island? A/C/G/T C-G Islands Example G A change C T G A C T**S1**S2 SL-1 SL X1 X2 XL-1 XL The query of interest: K = * * ( s ,..., s ) max arg p ( s ,..., s | x , , x ) 1 1 1 L L L (s ,..., ) s 1 L Hidden Markov Model for CpG Islands Domain(Si)={+, -} (2 values) The states: In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . E.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.**M**M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL 1. Most Probable state path = decoding First Question: Given an output sequence x = (x1,…,xL), Amost probablepaths*= (s*1,…,s*L)is one which maximizes p(s|x).**1**2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K … x2 x3 xK Decoding GIVEN x = x1x2……xN We want to find = 1, ……, N, such that P[ x, ] is maximized * = argmax P[ x, ] We can use dynamic programming! Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k**Decoding – main idea**Given that for all states k, and for a fixed position i, Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] What is Vj(i+1)? From definition, Vj(i+1) = max{1,…,i}P[ x1…xi, 1, …, i, xi+1, i+1 = j ] = max{1,…,i}P(xi+1, i+1 = j | x1…xi,1,…, i) P[x1…xi, 1,…, i] = max{1,…,i}P(xi+1, i+1 = j | i ) P[x1…xi-1, 1, …, i-1, xi, i] = maxk [P(xi+1, i+1 = j | i = k) max{1,…,i-1}P[x1…xi-1,1,…,i-1, xi,i=k]] = ej(xi+1)maxk akjVk(i)**The Viterbi Algorithm**Input: x = x1……xN Initialization: V0(0) = 1 (0 is the imaginary first position) Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxk akj Vk(i-1) Ptrj(i) = argmaxk akj Vk(i-1) Termination: P(x, *) = maxk Vk(N) Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i)**The Viterbi Algorithm**Similar to “aligning” a set of states to a sequence Time: O(K2N) Space: O(KN) x1 x2 x3 ………………………………………..xN State 1 2 Vj(i) K**Viterbi Algorithm – a practical detail**Underflows are a significant problem P[ x1,…., xi, 1, …, i ] = a01 a12……ai e1(x1)……ei(xi) These numbers become extremely small – underflow Solution: Take the logs of all values Vl(i) = logek(xi) + maxk [ Vk(i-1) + log akl ]**Example**Let x be a sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½ 6’s… x = 123456123456…123456626364656…1626364656 Then, it is not hard to show that optimal parse is : FFF…………………...FLLL………………………...L 6 characters “123456” parsed as F, contribute .956(1/6)6 = 1.610-5 parsed as L, contribute .956(1/2)1(1/10)5 = 0.410-5 “162636” parsed as F, contribute .956(1/6)6 = 1.610-5 parsed as L, contribute .956(1/2)3(1/10)3 = 9.010-5**M**M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL 2. Computing p(x) = evaluation Given an output sequence x = (x1,…,xL), Compute the probability that this sequence was generated: The summation taken over all state-paths s generating x.**Evaluation**We will develop algorithms that allow us to compute: P(x) Probability of x given the model P(xi…xj) Probability of a substring of x given the model P(i = k | x) Probability that the ith state is k, given x**The Forward Algorithm**We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x | ) P() To avoid summing over an exponential number of paths , define fk(i) = P(x1…xi, i = k) (the forward probability)**The Forward Algorithm – derivation**Define the forward probability: fk(i) = P(x1…xi, i = k) = 1…i-1 P(x1…xi-1, 1,…, i-1, i = k) ek(xi) = j 1…i-2 P(x1…xi-1, 1,…, i-2, i-1 = j) ajk ek(xi) = ek(xi) j fj(i-1) ajk**The Forward Algorithm**We can compute fk(i) for all k, i, using dynamic programming! Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fk(i) = ek(xi) j fj(i-1) ajk Termination: P(x) = k fk(N) ak0 Where, ak0 is the probability that the terminating state is k (usually = a0k)**Relation between Forward and Viterbi**VITERBI Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxkVk(i-1) akj Termination: P(x, *) = maxkVk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fj(i) = ej(xi) k fk(i-1) akj Termination: P(x) = k fk(N) ak0**Motivation for the Backward Algorithm**We want to compute P(i = k | x), the probability distribution on the ith position, given x We start by computing P(i = k, x) = P(x1…xi, i = k, xi+1…xN) = P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k) = P(x1…xi, i = k) P(xi+1…xN | i = k) Then, P(i = k | x) = P(i = k, x) / P(x) Forward, fk(i) Backward, bk(i)**The Backward Algorithm – derivation**Define the backward probability: bk(i) = P(xi+1…xN | i = k) = i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k) = j i+1…N P(xi+1,xi+2, …, xN, i+1 = j, i+2, …, N | i = k) = j ej(xi+1) akj i+2…N P(xi+2, …, xN, i+2, …, N | i+1 = j) = j ej(xi+1) akj bj(i+1)**The Backward Algorithm**We can compute bk(i) for all k, i, using dynamic programming Initialization: bk(N) = ak0, for all k Iteration: bk(i) = j ej(xi+1) akj bj(i+1) Termination: P(x) = j a0j ej(x1) bj(1)**Computational Complexity**What is the running time, and space required, for Forward, and Backward? Time: O(K2N) Space: O(KN) Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant**Viterbi, Forward, Backward**VITERBI Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxkVk(i-1) akj Termination: P(x, *) = maxkVk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fj(i) = ej(xi) k fk(i-1) akj Termination: P(x) = k fk(N) ak0 BACKWARD Initialization: bk(N) = ak0, for all k Iteration: bj(i) = k ej(xi+1) akj bk(i+1) Termination: P(x) = k a0k ek(x1) bk(1)**Posterior Decoding**We can now calculate fk(i) bk(i) P(i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^i = argmaxkP(i = k | x)**Posterior Decoding**• For each state, • Posterior Decoding gives us a curve of likelihood of state for each position • That is sometimes more informative than Viterbi path * • Posterior Decoding may give an invalid sequence of states • Why?**Posterior Decoding**• P(i = k | x) = P( | x) 1(i = k) = {:[i] = k}P( | x) x1 x2 x3 …………………………………………… xN State 1 P(i=l|x) l k**Posterior Decoding**• Example: How do we compute P(i = l, ji = l’ | x)? fl(i) bl(j) P(i = l, iI = l’ | x) = ––––––– P(x) x1 x2 x3 …………………………………………… xN State 1 P(i=l|x) l P(j=l’|x) k**Applications of the model**Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(xi is in CpG island) = P(i = A+ | x) Posterior Decodingcan assign locally optimal predictions of CpG islands ^i = argmaxkP(i = k | x)**A model of CpG Islands – (2) Transitions**• What about transitions between (+) and (-) states? • They affect • Avg. length of CpG island • Avg. separation between two CpG islands 1-p Length distribution of region X: P[lX = 1] = 1-p P[lX = 2] = p(1-p) … P[lX= k] = pk(1-p) E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) X Y p q 1-q