1 / 151

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 12-14: Markov Chain and Hidden Markov Models. Outline. Background on probability Hidden Markov models Algorithms Applications. Probability Basics. Definition (informal) Experiment : e.g. toss a coin 10 times or sequence a genome

tammybowers
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 12-14: Markov Chain and Hidden Markov Models

  2. Outline • Background on probability • Hidden Markov models • Algorithms • Applications

  3. Probability Basics • Definition (informal) • Experiment: e.g. toss a coin 10 times or sequence a genome • Outcome: A possible result of an experiment. e.g. HHHTTTHTTH or ACGTTAAACGTA • The sample space S of a random experiment is the set of all possible outcomes. e.g {H, T}10 • Event: any subset of the sample space. E.g.: > 4 heads • Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed • A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

  4. Example 0  P(Ai)  1 P(S) = 1

  5. Probabilistic Calculus • P(A U B) = P(A) + P(B) – P(A ∩ B) • If A, B are mutually exclusive: P(A ∩ B) = 0 P(A U B) = P(A) + P(B) • A and not(A) are mutually exclusive • Thus: P(not(A)) = P(Ac) = 1 – P(A) Either A or B both A and B s A B

  6. Joint and conditional probability • The joint probability of two events A and B P(A∩B), or simply P(A, B) is the probability that event A and B occur at the same time. • The conditional probability of P(A|B) is the probability that A occurs given B occurred. P(A | B) = P(A ∩ B) / P(B) P(A ∩ B) = P(A | B) * P(B)

  7. Example • Roll a die • If I tell you the number is less than 4 • What is the prob for the number to be even? P(d = even | d < 4) = P(d = even ∩ d < 4) / P(d < 4) = P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

  8. Independence • P(A | B) = P(A ∩ B) / P(B) => P(A ∩ B) = P(B) * P(A | B) • A, B are independentiff • P(A ∩ B) = P(A) * P(B) • That is, P(A) = P(A | B) • Also implies that P(B) = P(B | A) • P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

  9. Examples • Are P(d = even) and P(d < 4) independent? P(d = even and d < 4) = 1/6  P(d = even) * P(d < 4) = 1/4 or P(d = even) = ½  P(d = even | d < 4) = 1/3 • If the die has 8 faces, will P(d = even) and P(d < 5) be independent?

  10. Theorem of total probability • Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S. • An event A can be represented as: • Since B1, B2, …, BN are mutually exclusive, then • P(A) = P(A∩B1) + P(A∩B2) + … + P(A∩BN) • And therefore • P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN) • = i P(A | Bi) * P(Bi) Marginalization Exhaustive conditionalization

  11. Example • A loaded die: • P(6) = 0.5 • P(1) = … = P(5) = 0.1 • Prob of even number? P(even) = P(even | d < 6) * P (d<6) + P(even | d = 6) * P (d=6) = 2/5 * 0.5 + 1 * 0.5 = 0.7

  12. Another example • A box of dice: • 99% fair • 1% loaded • P(6) = 0.5. • P(1) = … = P(5) = 0.1 • Randomly pick a die and roll, P(6)? • P(6) = P(6 | F) * P(F) + P(6 | L) * P(L) • 1/6 * 0.99 + 0.5 * 0.01 = 0.17

  13. Chain rule • P(x1, x2, x3) = P(x1, x2, x3) / P(x2, x3) * P(x2, x3) / P(x3) * P(x3) = P(x1 | x2, x3) P(x2 | x3) P(x3) x3 x2 x1

  14. Bayes theorem • P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A) Conditional probability (likelihood) P ( A | B ) P ( B ) Prior of B => = P ( B | A ) P ( A ) Posterior probability Prior of A (Normalizing constant) This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

  15. Bayes theorem (cont’d) • Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj? • P(Bj | A) = P(A | Bj) * P(Bj) / P(A) = P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj) Posterior probability Prior of Bj Likelihood Normalizing constant Bj: different models In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj ! (theorem of total probabilities)

  16. Example • Prosecutor’s fallacy • Some crime happened • The criminal left not evidence except hair • The police got his DNA from his hair • Expert matched the DNA with someone’s DNA in a database • Expert said both false-positive and false negative rates are 10-6 • Can this be used as an evidence of guilty against the suspect?

  17. Prosecutor’s fallacy • False Pos: P(match | innocent) = 10-6 • False Neg: P(no match | guilty) = 10-6 • P(match | guilty) = 1 - 10-6 ~ 1 • P(no match | innocent) = 1 - 10-6 ~ 1 • P(guilty | match) = ?

  18. Prosecutor’s fallacy P (g | m) = P (m | g) * P(g) / P (m) ~ P(g) / P(m) • P(g): the prior probability for someone to be guilty with no DNA evidence • P(m): the probability for a DNA match • How to get these two numbers? • Don’t really care P(m) • Want to compare two models: • P(g | m) and P(i | m)

  19. Prosecutor’s fallacy • P(i | m) = P(m | i) * P(i) / P(m) • P(g | m) = P(m | g) * P(g) / P(m) • Therefore P(i | m) / P(g | m) = P(m | i) / P(m | g) * P(i) / P(g) = 10-6 * P(i) / P(g) • P(i) + p(g) = 1 • It is clear, therefore, that whether we can conclude the suspect is guilty depends on the prior probability P(g)

  20. Prosecutor’s fallacy • How do you get P(g)? • Depending on what other information you have on the suspect • Say if the suspect has no other connection with the crime, and the overall crime rate is 10-7 • That’s a reasonable prior for P(g) • P(g) = 10-7, P(i) ~ 1 • P(i | m) / P(g | m) = 10-6 * P(i) / P(g) = 10-6/10-7 = 10 • Or: P(i | m) = 0.91 and P(g | m) = 0.09 • Suspect is more likely to be innocent than guilty, given only the DNA samples

  21. Another example • A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without. • The disease is present in the population at 1 in 100,000 • What is P(disease | positive test)? • P(D|P) / P(H|P) ~ 0.01 • What is P(disease | negative test)? • P(D|N) / P(H|N) ~ 5e-8

  22. Yet another example • We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six) • We said if we randomly pick a die and roll, we have 17% of chance to get a six • If we get 3 six in a row, what’s the chance that the die is loaded? • How about 5 six in a row?

  23. P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21 • P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

  24. Simple probabilistic models for DNA sequences • Assume nature generates a type of DNA sequence as follows: • A box of dice, each with four faces: {A,C,G,T} • Select a die suitable for the type of DNA • Roll it, append the symbol to a string. • Repeat 3, until all symbols have been generated. • Given a string say X=“GATTCCAA…” and two dice • M1 has the distribution of pA=pC=pG=pT=0.25. • M2 has the distribution: pA=pT=0.20, pC=pG=0.30 • What is the probability of the sequence being generated by M1 or M2?

  25. Model selection by maximum likelihood criterion • X = GATTCCAA • P(X | M1) = P(x1,x2,…,xn | M1) = i=1..nP(xi|M1) = 0.258 • P(X | M2) = P(x1,x2,…,xn | M2) = i=1..nP(xi|M2) = 0.25 0.33 P(X|M1) / P(X|M2) =  P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3 LLR =  log(P(xi|M1)/P(xi|M2)) = nASA + nCSC + nGSG + nTST = 5 * log(1.25) + 3 * log(0.833) = 0.57 Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T Log likelihood ratio (LLR)

  26. Model selection by maximum a posterior probability criterion • Take the prior probabilities of M1 and M2 into consideration if known Log (P(M1|X) / P(M2|X)) = LLR + log(P(M1)) – log(P(M2)) = nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2)) • If P(M1) ~ P(M2), results will be similar to LLR test

  27. Markov models for DNA sequences We have assumed independence of nucleotides in different positions - unrealistic in biology

  28. Example: CpG islands • CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone) • In mammal promoter regions, CpG is more frequent than other regions of genome • often mark gene-rich regions

  29. CpG islands • CpG Islands • More CpG than elsewhere • More C & G than elsewhere, too • Typical length: few 100 to few 1000 bp • Questions • Is a short sequence (say, 200 bp) a CpG island or not? • Given a long sequence (say, 10-100kb), find CpG islands?

  30. Markov models • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values: • First order (k=1): • Second order: • 0th order: (independence)

  31. First order Markov model

  32. A 1st order Markov model for CpG islands • Essentially a finite state automaton (FSA) • Transitions are probabilistic (instead of deterministic) • 4 states: A, C, G, T • 16 transitions: ast = P(xi = t | xi-1 = s) • Begin/End states

  33. Probability of emitting sequence x

  34. Probability of a sequence • What’s the probability of ACGGCTA in this model? P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA • Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

  35. Training • Estimate the parameters of the model • CpG+ model: Count the transition frequencies from known CpG islands • CpG- model: Also count the transition frequencies from sequences without CpG islands • ast = #(s→t) / #(s →) a+st a-st

  36. Discrimination / Classification • Given a sequence, is it CpG island or not? • Log likelihood ratio (LLR) βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812 βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825

  37. Example • X = ACGGCGACGTCG • S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG = βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC = -0.825+ 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573 = 7.25

  38. CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

  39. Questions • Q1: given a short sequence, is it more likely from CpG+ model or CpG- model? • Q2: Given a long sequence, where are the CpG islands (if any)? • Approach 1: score (e.g.) 100 bp windows • Pro: simple • Con: arbitrary, fixed length, inflexible • Approach 2: combine +/- models.

  40. Combined model • Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

  41. Hidden Markov Model (HMM) • Introduced in the 70’s for speech recognition • Have been shown to be good models for biosequences • Alignment • Gene prediction • Protein domain analysis • … • An observed sequence data that can be modeled by a Markov chain • State path unknown • Model parameter known or unknown • Observed data: emission sequences X = (x1x2…xn) • Hidden data: state sequences Π = (π1π2…πn)

  42. Hidden Markov model (HMM) Definition: A hidden Markov model (HMM) is a five-tuple • Alphabet = { b1, b2, …, bM } • Set of statesQ = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilitiesa0i a01 + … + a0K = 1 • Emission probabilities within each state ek(b) = P( xi = b | i = k) ek(b1) + … + ek(bM) = 1, for all states k = 1…K 1 2 K …

  43. HMM for the Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 P(6) = 1/2 Casino player switches back and forth between fair and loaded die once in a while

  44. The dishonest casino model aFF = 0.95 aFL = 0.05 aLL = 0.95 Fair LOADED eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 aLF = 0.05 Transition probability Emission probability

  45. Simple scenario • You don’t know the probabilities • The casino player lets you observe which die he/she uses every time • The “state” of each roll is known • Training (parameter estimation) • How often the casino player switches dice? • How “loaded” is the loaded die? • Simply count the frequency that each face appeared and the frequency of die switching • May add pseudo-counts if number of observations is small

  46. More complex scenarios • The “state” of each roll is unknown: • You are given the results of a series of rolls • You don’t know which number is generated by which die • You may or may not know the parameters • How “loaded” is the loaded die • How frequently the casino player switches dice

  47. The three main questions on HMMs • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence  of states that maximizes P (x,  | M ) • Evaluation GIVEN a HMM M, and a sequence x, FIND P ( x | M ) [or P(x)for simplicity] • Learning GIVEN a HMM M with unspecified transition/emission probs., and a sequence x, FIND parameters  = (ei(.), aij) that maximize P (x | ) Sometimes written as P (x, ) for simplicity.

  48. Question # 1 – Decoding GIVEN A HMM with parameters. And a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

  49. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, and a HMM with k states, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

  50. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … Probability of a parse 1 Given a sequence x = x1……xN and a parse  = 1, ……, N To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) =a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

More Related