1 / 31

Hidden Markov Models

Hidden Markov Models. 1. 2. 2. 1. 1. 1. 1. …. 2. 2. 2. 2. …. K. …. …. …. …. x 1. K. K. K. K. …. x 2. x 3. x K. Decoding. GIVEN x = x 1 x 2 ……x N We want to find  =  1 , ……,  N , such that P[ x,  ] is maximized  * = argmax  P[ x,  ]

bonita
Download Presentation

Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models

  2. 1 2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K … x2 x3 xK Decoding GIVEN x = x1x2……xN We want to find  = 1, ……, N, such that P[ x,  ] is maximized * = argmax P[ x,  ] We can use dynamic programming! Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k

  3. The Viterbi Algorithm x1 x2 x3 ………………………………………..xN Similar to “aligning” a set of states to a sequence Time: O(K2N) Space: O(KN) State 1 2 Vj(i) K

  4. Evaluation We demonstrated algorithms that allow us to compute: P(x) Probability of x given the model P(xi…xj) Probability of a substring of x given the model P(I = k | x) Probability that the ith state is k, given x A more refined measure of which states x may be in

  5. Viterbi, Forward, Backward VITERBI Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vj(i) = ej(xi) maxkVk(i-1) akj Termination: P(x, *) = maxkVk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fl(i) = el(xi) k fk(i-1) akl Termination: P(x) = k fk(N) ak0 BACKWARD Initialization: bk(N) = ak0, for all k Iteration: bk(i) = l el(xi+1) akl bl(i+1) Termination: P(x) = l a0l el(x1) bl(1)

  6. Posterior Decoding We can now calculate fk(i) bk(i) P(i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^i = argmaxkP(i = k | x)

  7. Implementation Techniques Viterbi: Sum-of-logs Vl(i) = logek(xi) + maxk [ Vk(i-1) + log akl ] Forward/Backward: Scaling by c(i) One way to perform scaling: fl(i) = c(i)  [el(xi) k fk(i-1) akl]where c(i) = 1/( k fk(i)) bl(i): use the same factors c(i) Details in Rabiner’s Tutorial on HMMs, 1989

  8. A+ C+ G+ T+ A- C- G- T- A modeling Example CpG islands in DNA sequences

  9. Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

  10. Example: CpG Islands • In CpG islands, • CG is more frequent • Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

  11. A model of CpG Islands – (1) Architecture A+ C+ G+ T+ CpG Island A- C- G- T- Not CpG Island

  12. A model of CpG Islands – (2) Transitions How do we estimate the parameters of the model? Emission probabilities: 1/0 • Transition probabilities within CpG islands Established from many known (experimentally verified) CpG islands (Training Set) • Transition probabilities within other regions Established from many known non-CpG islands

  13. Parenthesis – log likelihoods Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x1…xN A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)  i L(xi, xi+1) > 0

  14. A model of CpG Islands – (2) Transitions • What about transitions between (+) and (-) states? • They affect • Avg. length of CpG island • Avg. separation between two CpG islands 1-p Length distribution of region X: P[lX = 1] = 1-p P[lX = 2] = p(1-p) … P[lX= k] = pk(1-p) E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) X Y p q 1-q

  15. A model of CpG Islands – (2) Transitions No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide To determine transition probabilities between (+) and (-) states • Estimate average length of a CpG island: lCPG = 1/(1-p)  p = 1 – 1/lCPG • For each pair of (+) states k, l, let akl p × akl • For each (+) state k, (-) state l, let akl = (1-p)/4 (better: take frequency of l in the (-) regions into account) • Do the same for (-) states A problem with this model: CpG islands don’t have exponential length distribution This is a defect of HMMs – compensated with ease of analysis & computation

  16. Applications of the model Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(xi is in CpG island) = P(i = A+ | x) Posterior Decodingcan assign locally optimal predictions of CpG islands ^i = argmaxkP(i = k | x)

  17. What if a new genome comes? • We just sequenced the porcupine genome • We know CpG islands play the same role in this genome • However, we have no known CpG islands for porcupines • We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

  18. Problem 3: Learning Re-estimate the parameters of the model based on training data

  19. Two learning scenarios • Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls • Estimation when the “right answer” is unknown Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

  20. 1. When the right answer is known Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Akic Ek(c)

  21. 1. When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

  22. Pseudocounts Solution for small training sets: Add pseudocounts Akl = # times kl transition occurs in  + rkl Ek(b) = # times state k in  emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts ( < 1): just to avoid 0 probabilities

  23. Pseudocounts Example: dishonest casino We will observe player for one day, 500 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rF(1) = rF(2) = … = rF(6) = 5 (wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

  24. 2. When the right answer is unknown We don’t know the true Akl, Ek(b) Idea: • We estimate our “best guess” on what Akl, Ek(b) are • We update the parameters of the model, based on our guess • We repeat

  25. 2. When the right answer is unknown Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  Principle: EXPECTATION MAXIMIZATION • Estimate Akl, Ek(b) in the training data • Update  according to Akl, Ek(b) • Repeat 1 & 2, until convergence

  26. Estimating new parameters To estimate Akl: At each position i of sequence x, Find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | )

  27. Estimating new parameters So, fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) Similarly, Ek(b) = [1/P(x)]{i | xi = b} fk(i) bk(i)

  28. Estimating new parameters If we have several training sequences, x1, …, xM, each of length N, fk(i) akl el(xi+1) bl(i+1) Akl = x i P(i = k, i+1 = l | x, ) = x i –––––––––––––––– P(x | ) Similarly, Ek(b) = x (1/P(x)){i | xi = b} fk(i) bk(i)

  29. The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: • Forward • Backward • Calculate Akl, Ek(b) • Calculate new model parameters akl, ek(b) • Calculate new log-likelihood P(x | ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

  30. The Baum-Welch Algorithm – comments Time Complexity: # iterations  O(K2N) • Guaranteed to increase the log likelihood of the model P( | x) = P(x, ) / P(x) = P(x | ) / ( P(x) P() ) • Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions • Too many parameters / too large model: Overtraining

  31. Alternative: Viterbi Training Initialization: Same Iteration: • Perform Viterbi, to find * • Calculate Akl, Ek(b) according to * + pseudocounts • Calculate the new parameters akl, ek(b) Until convergence Notes: • Convergence is guaranteed – Why? • Does not maximize P(x | ) • In general, worse performance than Baum-Welch

More Related