1 / 35

Hidden Markov Model Lecture #6

Hidden Markov Model Lecture #6. Background Readings : Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis , Durbin et al., 2001. Use of Markov Chains in Genome search: Modeling CpG Islands.

Download Presentation

Hidden Markov Model Lecture #6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov ModelLecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001. .

  2. Use of Markov Chains in Genome search: Modeling CpG Islands In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands(-C-phosphate-G-).

  3. Example: CpG Island (Cont.) We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? We solved Question 1 by using two models for DNA strings: “+” model and “-” model, for strings with and without CpG islands. Each model was a Markov chain over the states {A,C,G,T} with appropriate transition probabilities.

  4. CpG Island: Question 2 Question 2: Given a long piece of genomic data, does it contain CpG islands, and where? For solving this question, we need to decide which parts of a given long sequence of letters is more likely to come from the “+” model, and which parts are more likely to come from the “–” model.

  5. Model for question 2 Given a long genomic string with possible CpG Islands, we define a Markov Chain over 8 states, all interconnected (hence it is ergodic): The problem is that we don’t know the sequence of states which are traversed, but just the sequence of letters. A+ C+ G+ T+ A- C- G- T- Therefore we use here Hidden Markov Model, which we define and study next.

  6. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Hidden Markov Model HMM consists of: A Markov chain over a set of states, and for each state s and symbol x, an emission probability p(Xi=x|Si=s). Notations: Markov Chain transition probabilities: p(Si+1= t|Si= s) = mst Emission probabilities: p(Xi = b| Si = s) = es(b)

  7. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Hidden Markov Model The probability of the chain S and the emitted letters X is :

  8. Probability distribution defined by HMM Claim: Let M=(mst) and Es=(es(b)) be given stochastic matrices. Then for each fixed L>0, the function p defined by: is a probability distribution over all hidden Markov models of length L. That is:

  9. Probability distribution defined by HMM Proof by induction on L. For L=1:

  10. Probability distribution defined by HMM Induction step: Assume correctness for L, prove for L+1: 1 1 1 (by induction)

  11. Independence properties in HMM • We would like that HMM will satisfy certain independence properties, e.g: • The distribution of states Sk is completely determined by the identity of the preceding state sk-1, • The distribution of transmitted letter Xkis completely determined by the transmitting state sk. • In the next slides we will formally prove that 2. is implied by the probability distribution of HMM which we just defined.

  12. M M M M S1 SL Sk T T T x1 xL Xk=? Independence of emission probabilities Claim: The following equality holds: p(Xk=xk|x1,..,xk-1,xk+1,..,xL,s1,..,sk,..,sL) = esk(xk) B A

  13. Independence of emission probabilities Proof of claim: We use the definition of conditional probability, P(A|B) = P(A,B)/P(B). Note: p(A,B) denotes p(AB). A is the event Xk=xk. (the k-th output is xk). B is the event which specifies all the sequence except Xk : (X1= x1,.., Xk-1= xk-1, Xk+1= xk+1,.., XL= xL,S1= s1, .., SL= sL). (A,B) is the event (X1= x1,.., XL= xL,S1= s1, .., SL= sL).

  14. M M M M S1 SL Sk T T T x1 xL Xk=? Independence of emission probabilities Proof (cont)

  15. Independence of emission probabilities Proof (end): From the previous equalities we have: = p(A,B)/esk(xk) Thus we conclude: P(A|B) = P(A,B)/P(B) = esk(xk) QED

  16. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Independence of emission probabilities Exercise: Using the definition of conditional probability: P(A|B) = P(A,B)/P(B), prove formally that for any set of constraints B: B {X1= x1,.., Xi-1= xi-1, Xi+1= xi+1,.., XL= xL,S1= s1,..,Si= si, .., SL= sL}, such that “Si=si” B, it holds that p(Xi=xi|B) = esi(xi) Hint: express the probabilities as sum of p(S,X) over all possible S and X.

  17. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Hidden Markov Model: three questions of interest • Given the “visible” sequence x=(x1,…,xL), find: • A most probable (hidden) path. • The probability of x. • For each i = 1,..,L, and for each state k, the probability that si=k.

  18. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL 1. Most Probable state path Given an output sequence x = (x1,…,xL), Amost probablepaths*= (s*1,…,s*L)is one which maximizes p(s|x).

  19. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Most Probable path (cont.) Since we need to find swhich maximizes p(s,x) We use a DP algorithm, called Viterbi’s algorithm, which we present next.

  20. s1 s2 si X1 X2 Xi Viterbi’s algorithm for most probable path The task: compute Let the states be {1,…,m} Idea: for i=1,…,L and for each state l, compute: vl(i) = the probability p(s1,..,si;x1,..,xi|si=l)of a most probable path up to i, which ends in state l.

  21. Si-1 s1 l ... X1 Xi-1 Xi Viterbi’s algorithm for most probable path vl(i) = the probability p(s1,..,si;x1,..,xi|si=l)of a most probable path up to i, which ends in state l. For i = 1,…,L and for each state l we have:

  22. Result: p(s1*,…,sL*;x1,…,xl) = Viterbi’s algorithm s1 s2 sL-1 sL si 0 X1 X2 XL-1 XL Xi We add the special initial state 0. Initialization: v0(0) = 1 , vk(0) = 0 for k > 0 For i=1 to L do for each state l : vl(i) = el(xi) MAXk {vk(i-1)mkl } ptri(l)=argmaxk{vk(i-1)mkl} [storing previous state for reconstructing the path] Termination:

  23. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL 2. Computing p(x) Given an output sequence x = (x1,…,xL), compute the probability that this sequence was generated by the given HMM: The summation taken over all state-paths s generating x.

  24. Forward algorithm for computing p(x) The task: compute Idea: for i=1,…,L and for each state l, compute: Fl(i) = p(x1,…,xi;si=l ), the probability of all the paths which emit (x1,..,xi) and end in state si=l. Use the recursive formula: si ? ? X1 Xi-1 Xi

  25. Result: p(x1,…,xL) = Forward algorithm for computing p(x) s1 s2 sL-1 sL si 0 X1 X2 XL-1 XL Xi Similar to Viterbi’s algorithm (use sum instead of maximum): Initialization: f0(0) := 1 , fk(0) := 0 for k>0 For i=1 to L do for each state l : Fl(i) = el(xi) ∑kFk(i-1)mkl

  26. M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL 3. The distribution of Si, given x Given an output sequence x = (x1,…,xL), Compute for each i=1,…,l and for each state k the probability that si = k. This helps to reply queries like: what is the probability that si is in a CpG island, etc.

  27. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Solution in two stages • For a fixed i and each state k, an algorithm to compute p(si=k | x1,…,xL). 2. An algorithm which perform this task for every i = 1,..,L, without repeating the first task L times.

  28. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Computing for a fixed i:

  29. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Computing for a fixed i (cont.) p(x1,…,xL,Si=l) = p(x1,…,xi, Si=l) p(xi+1,…,xL| x1,…,xi, Si=l) (by the equality p(A,B) = p(A)p(B|A ). p(x1,…,xi, Si=l)=Fl(i) , which is computed by the forward algorithm:

  30. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi B(si): The Backward algorithm Recall: p(x1,…,xL,Si=l) = p(x1,…,xi, Si=l) p(xi+1,…,xL| x1,…,xi, Si=l) We are left with the task to compute the Backward algorithm Bl(i) ≡ p(xi+1,…,xL| x1,…,xi, Si=l), and get the desired result: p(x1,…,xL, Si=l) = p(x1,…,xi, Si=l) p(xi+1,…,xL | Si=l) ≡ Fi(l)·Bi(l)

  31. B(si): The Backward algorithm s1 s2 Si+1 sL si X1 X2 Xi+1 XL Xi From the probability distribution of Hidden Markov Model and the definition of conditional probability: Bl(i) = p(xi+1,…,xL| x1,…,xi,Si=l) = p(xi+1,…,xL|Si=l) =

  32. B(si): The backward algorithm (cont.) Si Si+1 Si+2 SL Xi+1 Xi+2 XL Thus we compute Bl(i) from the values of Bk(i) for all states k using the backward recursion:

  33. B(si): The backward algorithm (end) SL-1 SL XL First step, step L-1: Compute Bl(L-1) for each possible state l: For i=L-2 down to 1, for each possible state l, compute Bl(i) from the values of Bk(i+1) :

  34. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi The combined answer • To compute the probability that Si=lfor all states l, given x=(x1,…,xL), run the forward algorithm and compute Fl(i) = P(x1,…,xi,Si=l), run the backward algorithm to compute Bl(i) = P(xi+1,…,xL|Si=l), the product Fl(i)Bl(i) is the answer. • 2. To compute these probabilities for every isimply run the forward and backward algorithms once, storing Fl(i) and Bl(i) for all i and l. Compute Fl(i)Bl(i) for all i and l.

  35. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Time and Space Complexity of the forward/backward algorithms Time complexity is O(m2L) where m is the number of states. It is linear in the length of the chain, provided the number of states is a constant. Space complexity is also O(m2L).

More Related