CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

In the next few lectures • Hidden Markov Models • The theory • Probabilistic treatment of sequence alignments using HMMs • Application of HMMs to biological sequence modeling and discovery of features such as genes

Markov models • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values: • First order: • Second order: • 0th order: (independence)

Probability to generate a sequence with a 1st Markov model

CpG islands • For biological reasons, CpG (C followed by G) is very rare in mammal genomes • But relatively more common in promoter regions • Detecting CpG islands help predict genes • Model: consider all 16 conditional probabilities • P(G | C), P(T | C), P(A | C), P(C | C), P(G | T) …

A 1st order Markov model for CpG islands • Essentially a finite state automaton • Transitions are probabilistic • 4 states: A, C, G, T • 16 transition: ast = P(xi = t | xi-1 = s) • Begin/End states (for convenience)

Probability of a sequence • What’s the probability of ACGGCTA? • For 0-th order Markov model, simply P(A)P(C)… • For 1st order Markov model like the one above P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA

Probability of a sequence • What’s the probability of ACGGCTA? P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA • Equivalent: follow the path in the automaton, and multiply the probability of each arc on the path • Given a sequence, there is only one path you can walk.

Training • Estimate the parameters of the model • Training sequences: sequences with labels (CpG islands or not CpG islands) • Test sequences: sequences for test (labels removed) • Two models • CpG model learned from known CpG islands • Background model learned from known non-CpG islands • What to learn? • Transition frequencies • ast = #(s→t) / #(s →)

Training • Parameters learned from known CpG islands and non-CpG islands

Discrimination / Classification • Given a sequence, is it CpG island or not? • Log likelihood ratio • X = ACGGCTA • Compute log (P(X | CpG+) / P(X | CpG-)) • or log(P(CpG+ | X) / P(CpG- | X)) given priors • Q: how to get a+BA and a-BA? (B: begin state) P(X|CpG+) = a+BA a+AC a+CG …a+TA P(X|CpG-) = a-BA a-AC a-CG …a-TA

CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Questions • Q1: given a short sequence, is it more likely from feature model or background model? • Q2: Given a long sequence, where are the features in it (if any)? • Approach 1: score 100 bp (e.g.) windows • Pro: simple • Con: arbitrary, fixed length, inflexible • Approach 2: combine +/- models.

Combined model Now given a sequence, we cannot direct tell which state a symbol is in. We have two states for each symbol. (hidden Markov model => hidden states) ?

Decoding • Give you a sequence, ACGTACGTATA… • What’s the most probable path?

Most probable path Probability of a path is the product of all probabilities for the arcs on the path A C G T A C G T A T A+ C+ G+ T+ A+ C+ G+ T+ A+ T+ B A- C- G- T- A- C- G- T- A- T- Find a path with the following objective: Maximize the product of probabilities for the arcs on the path  Maximize the sum of log probabilities for the arcs on the path

Most probable path V(+, i+1) = max { V(+, i) + w(x+i, x+i+1) V(-, i) + w(x-i, x+i+1) } V(-, i+1) = max { V(+, i) + w(x+i, x-i+1) V(-, i) + w(x-i, x-i+1) } x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 A+ C+ G+ T+ A+ C+ G+ T+ A+ T+ B A- C- G- T- A- C- G- T- A- T- w(s, t) = log (ast)

Simpler but more general case 0.05 0.95 • We can go to any state at any time. Doesn’t depend on the input • Each state can emit any symbol with some probabilities (possibly zero) 0.95 Fair LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 Probability of a path is the product of all probabilities for the arcs on the path, and the possibilities of emitting the sequence of symbols

P(s) = 1/6, for s in [1..6] • Two types of rewards • Mileage: for using an arc (fixed) • Bonus: for visiting a node (depending on the token you show) 2 5 2 1 3 6 2 6 1 6 F F F F F F F F F F B L L L L L L L L L L P(s) = 1/10, for s in [1..5]P(6) = 1/2

P(s) = 1/6, for s in [1..6] V(F, i+1) = Max { V(F,i) + w(F,F) + r(F, xi+1), V(L,i) + w(L,F) + r(F, xi+1)} V(L, i+1) = Max { V(F,i) + w(F,F) + r(L, xi+1), V(L,i) + w(L,F) + r(L, xi+1)} x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F F F F F F F F F F B L L L L L L L L L L P(s) = 1/10, for s in [1..5]P(6) = 1/2 w(F, L) = log (aFL) r(F, x) = log (eF(x))

FSA interpretation 0.95 0.95 2.3 2.3 0.05 -0.7 Fair LOADED Fair LOADED 0.05 -0.7 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 r(F,1) = 0.5 r(F,2) = 0.5 r(F,3) = 0.5 r(F,4) = 0.5 r(F,5) = 0.5 r(F,6) = 0.5 r(L,1) = 0 r(L,2) = 0 r(L,3) = 0 r(L,4) = 0 r(L,5) = 0 r(L,6) = 1.6 r = log (10 * p) V(L, i+1) = max { V(L, i) + W(L, L) + r(L, xi+1), V(F, i) + W(F, L) + r(L, xi+1) } V(F, i+1) = max { V(L, i) + W(L, F) + r(F, xi+1), V(F, i) + W(F, F) + r(F, xi+1) }

More general cases x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 B 3 3 3 3 3 3 3 3 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . k k k k k k k k k k

V(1,i) + w(1,1) + r(1, xi+1), V(2,i) + w(2,1) + r(1, xi+1), V(1, i+1) = max V(3,i) + w(3,1) + r(1, xi+1), …… V(k,i) + w(k,1) + r(1, xi+1) V(1,i) + w(1,2) + r(2, xi+1), V(2,i) + w(2,2) + r(2, xi+1), V(2, i+1) = max V(3,i) + w(3,2) + r(2, xi+1), …… V(k,i) + w(k,2) + r(2, xi+1) A FSA with k states, fully connected ......

Generalize V(1,i) + w(1, j) + r(j, xi+1), V(2,i) + w(2, j) + r(j, xi+1), V(j, i+1) = max V(3,i) + w(3, j) + r(j, xi+1), …… V(k,i) + w(k, j) + r(j, xi+1) Or simply: V(j, i+1) = Maxl {V(l,i) + w(l, j) + r(j, xi+1)}

Previous equation assumes completely connected structure. In practice may not be the case. 0 0.9 0.9 0.05 0.05 Load_1 Fair Load_2 0.9 0.1 0.1 0 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L1) = 1/10 P(2|L1) = 1/10 P(3|L1) = 1/10 P(4|L1) = 1/10 P(5|L1) = 1/10 P(6|L1) = 1/2 P(1|L2) = 1/2 P(2|L2) = 1/10 P(3|L2) = 1/10 P(4|L2) = 1/10 P(5|L2) = 1/10 P(6|L2) = 1/10

Take advantage of sparse structure 0.9 0.9 0.05 0.05 Load_1 Fair Load_2 0.9 0.1 0.1 V(L1, i) + w(L1, F) + r(F, xi+1), V(F, i+1) = max V(L2, i) + w(L2, F) + r(F, xi+1), V(F, i) + w(F, F) + r(F, xi+1) V(L1, i) + w(L1, L1) + r(L1, xi+1), V(L1, i+1) = max V(F, i) + w(F, L1) + r(L1, xi+1) V(L2, i) + w(L2, L2) + r(L2, xi+1), V(L2, i+1) = max V(F, i) + w(F, L2) + r(L2, xi+1)

The Viterbi Algorithm Input: x = x1……xN Initialization: P0(0) = 1 (zero in subscript is the start state.) Pk(0) = 0, for all k > 0 (0 in parenthesis is the imaginary first position) Iteration: for each j = 1..N for each i = 1..k Pj(i) = ej(xi)  maxk akj Pk(i-1) Ptrj(i) = argmaxk akj Pk(i-1) end end Termination: Prob(x, *) = maxk Pk(N) Traceback: N* = argmaxk Pk(N) i-1* = Ptri (i)

The Viterbi Algorithm Input: x = x1……xN Initialization: V0(0) = 0 (zero in subscript is the start state.) Vk(0) = -inf, for all k > 0 (0 in parenthesis is the imaginary first position) Iteration: for each j for each i Vj(i) = rj(xi)+maxk (wkj+ Vk(i-1)) rj(xi) = log(ej(xi)), wkj = log(akj) Ptrj(i) = argmaxk (wkj+ Vk(i-1)) end end Termination: Prob(x, *) = exp{maxk Vk(N)} Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i)

The Viterbi Algorithm x1 x2 x3 ………………………………………..xN Similar to “aligning” a set of states to a sequence Time: O(K2N) Space: O(KN) State 1 2 Vj(i) K

So far so good • but

Problem • Is the most probable path necessarily the best one? • Single optimal vs multiple sub-optimal • What if there are many sub-optimal paths with slightly lower probabilities? • Global optimal vs local optimal • What’s best globally may not be the best for each individual

For example • Probability for path 1-2-5-7 is 0.4 • Probability for path 1-3-6-7 is 0.3 • Probability for path 1-4-6-7 is 0.3 • What’s the most probable state at step 2? • 0.4 goes through 5 • 0.6 goes through 6 • Viterbi may not be the only interesting answer

Another example • The dishonest casino • Say x = 12341623162616364616234161221341 • Most probable path:  = FF……F • However: marked letters more likely to be L than unmarked letters • Another way to interpret the problem • With Viterbi, every pos is assigned a label • Confidence level for each assignment?

Posterior decoding • Viterbi finds the path with the highest probability • The probability is usually very tiny anyway • We want to know • k = 1

Probability of a sequence • The probability that a sequence is generated by a model. P(X | M) • Sometimes simply written as P(X) • Sometimes written as P(X | M, θ) or P(X | θ) to emphasize that we are looking for θ to optimize the likelihood • Not equal to the probability of a path P(X, ) • There are many possible paths that can generate X • Each with a different probability • P(X) = P(X, ) = P(X | ) P() • Why do we need this? • How to compute without summing over all paths (exponential)?

The forward algorithm • Define fk(i) the probability to get to state k at step i • Sum over all possible previous paths • Remember the way we counted # alignments?

The forward algorithm

The forward algorithm We can compute fk(i) for all k, i, using dynamic programming! Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fk(i) = ek(xi) jfj(i-1) ajk Termination: Prob(x) = kfk(N)

Relation between Forward and Viterbi VITERBI Initialization: P0(0) = 1 Pk(0) = 0, for all k > 0 Iteration: Pk(i) = ek(xi) maxjPj(i-1) ajk Termination: Prob(x, *) = maxkPk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fk(i) = ek(xi) j fj(i-1) ajk Termination: Prob(x) = k fk(N)

The backward algorithm • Define bk(i) as the probability for paths starting from position i within state k, to the end of the sequence • Sum over all possible paths from i to N

1 This does not include the emission probability of xi

fk(i): prob to get to pos i in state k and emit xi • bk(i): prob from i to end, given i is in state k • What is fk(i) bk(i)?

Is your guess correct?

What we need is • But: • We have P(x) already in the forward algorithm • Can verify: k = 1

The forward-backward algorithm • Compute fk(i), for each state k and pos i • Compute bk(i), for each state k and pos I • Compute P(x) = kfk(N) • Compute P(i=k | x) = fk(i) * bk(i) / P(x)

The prob of x, with the constraint that it has to pass state k on step i Sequence state + Forward probabilities Backward probabilities / P(X) Space: O(KN) Time: O(K2N) P(i=k | x)

Relation to another F-B algorithm • We’ve learned a forward-backward algorithm in linear-space sequence alignment • Hirscheberg’s algorithm • Also known as forward-backward alignment algorithm y x

What’s P(i=k | x) good for? • For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level • Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding ^i = argmaxkP(i = k | x)

For example If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

CpG islands again • Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each • Viterbi: Post-process: • Found 46 of 48 46/48 • plus 121 “false positives” 67 false pos • Posterior Decoding: • same 2 false negatives 46/48 • plus 236 false positives 83 false pos Post-process: merge within 500; discard < 500

CS5263 Bioinformatics