CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 12-14: Markov Chain and Hidden Markov Models

Outline • Background on probability • Hidden Markov models • Algorithms • Applications

Probability Basics • Definition (informal) • Experiment: e.g. toss a coin 10 times or sequence a genome • Outcome: A possible result of an experiment. e.g. HHHTTTHTTH or ACGTTAAACGTA • The sample space S of a random experiment is the set of all possible outcomes. e.g {H, T}10 • Event: any subset of the sample space. E.g.: > 4 heads • Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed • A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

Example 0  P(Ai)  1 P(S) = 1

Probabilistic Calculus • P(A U B) = P(A) + P(B) – P(A ∩ B) • If A, B are mutually exclusive: P(A ∩ B) = 0 P(A U B) = P(A) + P(B) • A and not(A) are mutually exclusive • Thus: P(not(A)) = P(Ac) = 1 – P(A) Either A or B both A and B s A B

Joint and conditional probability • The joint probability of two events A and B P(A∩B), or simply P(A, B) is the probability that event A and B occur at the same time. • The conditional probability of P(A|B) is the probability that A occurs given B occurred. P(A | B) = P(A ∩ B) / P(B) P(A ∩ B) = P(A | B) * P(B)

Example • Roll a die • If I tell you the number is less than 4 • What is the prob for the number to be even? P(d = even | d < 4) = P(d = even ∩ d < 4) / P(d < 4) = P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

Examples • Are P(d = even) and P(d < 4) independent? P(d = even and d < 4) = 1/6  P(d = even) * P(d < 4) = 1/4 or P(d = even) = ½  P(d = even | d < 4) = 1/3 • If the die has 8 faces, will P(d = even) and P(d < 5) be independent?

Theorem of total probability • Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S. • An event A can be represented as: • Since B1, B2, …, BN are mutually exclusive, then • P(A) = P(A∩B1) + P(A∩B2) + … + P(A∩BN) • And therefore • P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN) • = i P(A | Bi) * P(Bi) Marginalization Exhaustive conditionalization

Example • A loaded die: • P(6) = 0.5 • P(1) = … = P(5) = 0.1 • Prob of even number? P(even) = P(even | d < 6) * P (d<6) + P(even | d = 6) * P (d=6) = 2/5 * 0.5 + 1 * 0.5 = 0.7

Another example • A box of dice: • 99% fair • 1% loaded • P(6) = 0.5. • P(1) = … = P(5) = 0.1 • Randomly pick a die and roll, P(6)? • P(6) = P(6 | F) * P(F) + P(6 | L) * P(L) • 1/6 * 0.99 + 0.5 * 0.01 = 0.17

Chain rule • P(x1, x2, x3) = P(x1, x2, x3) / P(x2, x3) * P(x2, x3) / P(x3) * P(x3) = P(x1 | x2, x3) P(x2 | x3) P(x3) x3 x2 x1

Bayes theorem • P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A) Conditional probability (likelihood) P ( A | B ) P ( B ) Prior of B => = P ( B | A ) P ( A ) Posterior probability Prior of A (Normalizing constant) This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

Bayes theorem (cont’d) • Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj? • P(Bj | A) = P(A | Bj) * P(Bj) / P(A) = P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj) Posterior probability Prior of Bj Likelihood Normalizing constant Bj: different models In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj ! (theorem of total probabilities)

Example • Prosecutor’s fallacy • Some crime happened • The criminal left not evidence except hair • The police got his DNA from his hair • Expert matched the DNA with someone’s DNA in a database • Expert said both false-positive and false negative rates are 10-6 • Can this be used as an evidence of guilty against the suspect?

Prosecutor’s fallacy P (g | m) = P (m | g) * P(g) / P (m) ~ P(g) / P(m) • P(g): the prior probability for someone to be guilty with no DNA evidence • P(m): the probability for a DNA match • How to get these two numbers? • Don’t really care P(m) • Want to compare two models: • P(g | m) and P(i | m)

Prosecutor’s fallacy • P(i | m) = P(m | i) * P(i) / P(m) • P(g | m) = P(m | g) * P(g) / P(m) • Therefore P(i | m) / P(g | m) = P(m | i) / P(m | g) * P(i) / P(g) = 10-6 * P(i) / P(g) • P(i) + p(g) = 1 • It is clear, therefore, that whether we can conclude the suspect is guilty depends on the prior probability P(g)

Prosecutor’s fallacy • How do you get P(g)? • Depending on what other information you have on the suspect • Say if the suspect has no other connection with the crime, and the overall crime rate is 10-7 • That’s a reasonable prior for P(g) • P(g) = 10-7, P(i) ~ 1 • P(i | m) / P(g | m) = 10-6 * P(i) / P(g) = 10-6/10-7 = 10 • Or: P(i | m) = 0.91 and P(g | m) = 0.09 • Suspect is more likely to be innocent than guilty, given only the DNA samples

Another example • A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without. • The disease is present in the population at 1 in 100,000 • What is P(disease | positive test)? • P(D|P) / P(H|P) ~ 0.01 • What is P(disease | negative test)? • P(D|N) / P(H|N) ~ 5e-8

Yet another example • We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six) • We said if we randomly pick a die and roll, we have 17% of chance to get a six • If we get 3 six in a row, what’s the chance that the die is loaded? • How about 5 six in a row?

P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21 • P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

Simple probabilistic models for DNA sequences • Assume nature generates a type of DNA sequence as follows: • A box of dice, each with four faces: {A,C,G,T} • Select a die suitable for the type of DNA • Roll it, append the symbol to a string. • Repeat 3, until all symbols have been generated. • Given a string say X=“GATTCCAA…” and two dice • M1 has the distribution of pA=pC=pG=pT=0.25. • M2 has the distribution: pA=pT=0.20, pC=pG=0.30 • What is the probability of the sequence being generated by M1 or M2?

Model selection by maximum a posterior probability criterion • Take the prior probabilities of M1 and M2 into consideration if known Log (P(M1|X) / P(M2|X)) = LLR + log(P(M1)) – log(P(M2)) = nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2)) • If P(M1) ~ P(M2), results will be similar to LLR test

Markov models for DNA sequences We have assumed independence of nucleotides in different positions - unrealistic in biology

Example: CpG islands • CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone) • In mammal promoter regions, CpG is more frequent than other regions of genome • often mark gene-rich regions

CpG islands • CpG Islands • More CpG than elsewhere • More C & G than elsewhere, too • Typical length: few 100 to few 1000 bp • Questions • Is a short sequence (say, 200 bp) a CpG island or not? • Given a long sequence (say, 10-100kb), find CpG islands?

Markov models • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values: • First order (k=1): • Second order: • 0th order: (independence)

First order Markov model

A 1st order Markov model for CpG islands • Essentially a finite state automaton (FSA) • Transitions are probabilistic (instead of deterministic) • 4 states: A, C, G, T • 16 transitions: ast = P(xi = t | xi-1 = s) • Begin/End states

Probability of emitting sequence x

Probability of a sequence • What’s the probability of ACGGCTA in this model? P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA • Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

Training • Estimate the parameters of the model • CpG+ model: Count the transition frequencies from known CpG islands • CpG- model: Also count the transition frequencies from sequences without CpG islands • ast = #(s→t) / #(s →) a+st a-st

Discrimination / Classification • Given a sequence, is it CpG island or not? • Log likelihood ratio (LLR) βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812 βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825

Example • X = ACGGCGACGTCG • S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG = βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC = -0.825+ 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573 = 7.25

CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Questions • Q1: given a short sequence, is it more likely from CpG+ model or CpG- model? • Q2: Given a long sequence, where are the CpG islands (if any)? • Approach 1: score (e.g.) 100 bp windows • Pro: simple • Con: arbitrary, fixed length, inflexible • Approach 2: combine +/- models.

Combined model • Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

Hidden Markov Model (HMM) • Introduced in the 70’s for speech recognition • Have been shown to be good models for biosequences • Alignment • Gene prediction • Protein domain analysis • … • An observed sequence data that can be modeled by a Markov chain • State path unknown • Model parameter known or unknown • Observed data: emission sequences X = (x1x2…xn) • Hidden data: state sequences Π = (π1π2…πn)

Hidden Markov model (HMM) Definition: A hidden Markov model (HMM) is a five-tuple • Alphabet = { b1, b2, …, bM } • Set of statesQ = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilitiesa0i a01 + … + a0K = 1 • Emission probabilities within each state ek(b) = P( xi = b | i = k) ek(b1) + … + ek(bM) = 1, for all states k = 1…K 1 2 K …

HMM for the Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 P(6) = 1/2 Casino player switches back and forth between fair and loaded die once in a while

The dishonest casino model aFF = 0.95 aFL = 0.05 aLL = 0.95 Fair LOADED eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 aLF = 0.05 Transition probability Emission probability

Simple scenario • You don’t know the probabilities • The casino player lets you observe which die he/she uses every time • The “state” of each roll is known • Training (parameter estimation) • How often the casino player switches dice? • How “loaded” is the loaded die? • Simply count the frequency that each face appeared and the frequency of die switching • May add pseudo-counts if number of observations is small

More complex scenarios • The “state” of each roll is unknown: • You are given the results of a series of rolls • You don’t know which number is generated by which die • You may or may not know the parameters • How “loaded” is the loaded die • How frequently the casino player switches dice

The three main questions on HMMs • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence  of states that maximizes P (x,  | M ) • Evaluation GIVEN a HMM M, and a sequence x, FIND P ( x | M ) [or P(x)for simplicity] • Learning GIVEN a HMM M with unspecified transition/emission probs., and a sequence x, FIND parameters  = (ei(.), aij) that maximize P (x | ) Sometimes written as P (x, ) for simplicity.

Question # 1 – Decoding GIVEN A HMM with parameters. And a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, and a HMM with k states, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Probability of a parse 1 Given a sequence x = x1……xN and a parse  = 1, ……, N To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) =a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

CS5263 Bioinformatics