Cs5263 bioinformatics
Download
1 / 61

CS5263 Bioinformatics - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

CS5263 Bioinformatics. Lecture 11: Markov Chain and Hidden Markov Models. In the next few lectures. Hidden Markov Models The theory Probabilistic treatment of sequence alignments using HMMs Application of HMMs to biological sequence modeling and discovery of features such as genes.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS5263 Bioinformatics' - trula


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs5263 bioinformatics

CS5263 Bioinformatics

Lecture 11: Markov Chain and Hidden Markov Models


In the next few lectures
In the next few lectures

  • Hidden Markov Models

    • The theory

    • Probabilistic treatment of sequence alignments using HMMs

    • Application of HMMs to biological sequence modeling and discovery of features such as genes


Markov models
Markov models

  • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values:

  • First order:

  • Second order:

  • 0th order: (independence)



Cpg islands
CpG islands

  • For biological reasons, CpG (C followed by G) is very rare in mammal genomes

  • But relatively more common in promoter regions

  • Detecting CpG islands help predict genes

    • Model: consider all 16 conditional probabilities

      • P(G | C), P(T | C), P(A | C), P(C | C), P(G | T) …


A 1 st order markov model for cpg islands
A 1st order Markov model for CpG islands

  • Essentially a finite state automaton

  • Transitions are probabilistic

  • 4 states: A, C, G, T

  • 16 transition: ast = P(xi = t | xi-1 = s)

  • Begin/End states (for convenience)


Probability of a sequence
Probability of a sequence

  • What’s the probability of ACGGCTA?

    • For 0-th order Markov model, simply P(A)P(C)…

    • For 1st order Markov model like the one above

      P(A) * P(C|A) * P(G|C) … P(A|T)

      = aBA aAC aCG …aTA


Probability of a sequence1
Probability of a sequence

  • What’s the probability of ACGGCTA?

    P(A) * P(C|A) * P(G|C) … P(A|T)

    = aBA aAC aCG …aTA

  • Equivalent: follow the path in the automaton, and multiply the probability of each arc on the path

  • Given a sequence, there is only one path you can walk.


Training
Training

  • Estimate the parameters of the model

    • Training sequences: sequences with labels (CpG islands or not CpG islands)

    • Test sequences: sequences for test (labels removed)

  • Two models

    • CpG model learned from known CpG islands

    • Background model learned from known non-CpG islands

  • What to learn?

    • Transition frequencies

    • ast = #(s→t) / #(s →)


Training1
Training

  • Parameters learned from known CpG islands and non-CpG islands


Discrimination classification
Discrimination / Classification

  • Given a sequence, is it CpG island or not?

  • Log likelihood ratio

  • X = ACGGCTA

  • Compute log (P(X | CpG+) / P(X | CpG-))

  • or log(P(CpG+ | X) / P(CpG- | X)) given priors

  • Q: how to get a+BA and a-BA? (B: begin state)

P(X|CpG+) = a+BA a+AC a+CG …a+TA

P(X|CpG-) = a-BA a-AC a-CG …a-TA


Cpg island scores
CpG island scores

Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.


Questions
Questions

  • Q1: given a short sequence, is it more likely from feature model or background model?

  • Q2: Given a long sequence, where are the features in it (if any)?

    • Approach 1: score 100 bp (e.g.) windows

      • Pro: simple

      • Con: arbitrary, fixed length, inflexible

    • Approach 2: combine +/- models.


Combined model
Combined model

Now given a sequence, we cannot direct tell which state a symbol is in. We have two states for each symbol. (hidden Markov model => hidden states)

?


Decoding
Decoding

  • Give you a sequence, ACGTACGTATA…

  • What’s the most probable path?


Most probable path
Most probable path

Probability of a path is the product of all probabilities for the arcs on the path

A C G T A C G T A T

A+

C+

G+

T+

A+

C+

G+

T+

A+

T+

B

A-

C-

G-

T-

A-

C-

G-

T-

A-

T-

Find a path with the following objective:

Maximize the product of probabilities for the arcs on the path

 Maximize the sum of log probabilities for the arcs on the path


Most probable path1
Most probable path

V(+, i+1) = max { V(+, i) + w(x+i, x+i+1)

V(-, i) + w(x-i, x+i+1) }

V(-, i+1) = max { V(+, i) + w(x+i, x-i+1)

V(-, i) + w(x-i, x-i+1) }

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

A+

C+

G+

T+

A+

C+

G+

T+

A+

T+

B

A-

C-

G-

T-

A-

C-

G-

T-

A-

T-

w(s, t) = log (ast)


Simpler but more general case
Simpler but more general case

0.05

0.95

  • We can go to any state at any time. Doesn’t depend on the input

  • Each state can emit any symbol with some probabilities (possibly zero)

0.95

Fair

LOADED

P(1|F) = 1/6

P(2|F) = 1/6

P(3|F) = 1/6

P(4|F) = 1/6

P(5|F) = 1/6

P(6|F) = 1/6

P(1|L) = 1/10

P(2|L) = 1/10

P(3|L) = 1/10

P(4|L) = 1/10

P(5|L) = 1/10

P(6|L) = 1/2

0.05

Probability of a path is the product of all probabilities for the arcs on the path, and the possibilities of emitting the sequence of symbols


P(s) = 1/6, for s in [1..6]

  • Two types of rewards

    • Mileage: for using an arc (fixed)

    • Bonus: for visiting a node (depending on the token you show)

2 5 2 1 3 6 2 6 1 6

F

F

F

F

F

F

F

F

F

F

B

L

L

L

L

L

L

L

L

L

L

P(s) = 1/10, for s in [1..5]P(6) = 1/2


P(s) = 1/6, for s in [1..6]

V(F, i+1) = Max { V(F,i) + w(F,F) + r(F, xi+1),

V(L,i) + w(L,F) + r(F, xi+1)}

V(L, i+1) = Max { V(F,i) + w(F,F) + r(L, xi+1),

V(L,i) + w(L,F) + r(L, xi+1)}

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

F

F

F

F

F

F

F

F

F

F

B

L

L

L

L

L

L

L

L

L

L

P(s) = 1/10, for s in [1..5]P(6) = 1/2

w(F, L) = log (aFL)

r(F, x) = log (eF(x))


Fsa interpretation
FSA interpretation

0.95

0.95

2.3

2.3

0.05

-0.7

Fair

LOADED

Fair

LOADED

0.05

-0.7

P(1|F) = 1/6

P(2|F) = 1/6

P(3|F) = 1/6

P(4|F) = 1/6

P(5|F) = 1/6

P(6|F) = 1/6

P(1|L) = 1/10

P(2|L) = 1/10

P(3|L) = 1/10

P(4|L) = 1/10

P(5|L) = 1/10

P(6|L) = 1/2

r(F,1) = 0.5

r(F,2) = 0.5

r(F,3) = 0.5

r(F,4) = 0.5

r(F,5) = 0.5

r(F,6) = 0.5

r(L,1) = 0

r(L,2) = 0

r(L,3) = 0

r(L,4) = 0

r(L,5) = 0

r(L,6) = 1.6

r = log (10 * p)

V(L, i+1) = max { V(L, i) + W(L, L) + r(L, xi+1), V(F, i) + W(F, L) + r(L, xi+1) }

V(F, i+1) = max { V(L, i) + W(L, F) + r(F, xi+1), V(F, i) + W(F, F) + r(F, xi+1) }


More general cases
More general cases

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

B

3

3

3

3

3

3

3

3

3

3

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

k

k

k

k

k

k

k

k

k

k


V(1,i) + w(1,1) + r(1, xi+1),

V(2,i) + w(2,1) + r(1, xi+1),

V(1, i+1) = max V(3,i) + w(3,1) + r(1, xi+1),

……

V(k,i) + w(k,1) + r(1, xi+1)

V(1,i) + w(1,2) + r(2, xi+1),

V(2,i) + w(2,2) + r(2, xi+1),

V(2, i+1) = max V(3,i) + w(3,2) + r(2, xi+1),

……

V(k,i) + w(k,2) + r(2, xi+1)

A FSA with k states, fully connected

......


Generalize
Generalize

V(1,i) + w(1, j) + r(j, xi+1),

V(2,i) + w(2, j) + r(j, xi+1),

V(j, i+1) = max V(3,i) + w(3, j) + r(j, xi+1),

……

V(k,i) + w(k, j) + r(j, xi+1)

Or simply:

V(j, i+1) = Maxl {V(l,i) + w(l, j) + r(j, xi+1)}


0

0.9

0.9

0.05

0.05

Load_1

Fair

Load_2

0.9

0.1

0.1

0

P(1|F) = 1/6

P(2|F) = 1/6

P(3|F) = 1/6

P(4|F) = 1/6

P(5|F) = 1/6

P(6|F) = 1/6

P(1|L1) = 1/10

P(2|L1) = 1/10

P(3|L1) = 1/10

P(4|L1) = 1/10

P(5|L1) = 1/10

P(6|L1) = 1/2

P(1|L2) = 1/2

P(2|L2) = 1/10

P(3|L2) = 1/10

P(4|L2) = 1/10

P(5|L2) = 1/10

P(6|L2) = 1/10


Take advantage of sparse structure
Take advantage of sparse structure practice may not be the case.

0.9

0.9

0.05

0.05

Load_1

Fair

Load_2

0.9

0.1

0.1

V(L1, i) + w(L1, F) + r(F, xi+1),

V(F, i+1) = max V(L2, i) + w(L2, F) + r(F, xi+1),

V(F, i) + w(F, F) + r(F, xi+1)

V(L1, i) + w(L1, L1) + r(L1, xi+1),

V(L1, i+1) = max V(F, i) + w(F, L1) + r(L1, xi+1)

V(L2, i) + w(L2, L2) + r(L2, xi+1),

V(L2, i+1) = max V(F, i) + w(F, L2) + r(L2, xi+1)


The viterbi algorithm
The Viterbi Algorithm practice may not be the case.

Input: x = x1……xN

Initialization:

P0(0) = 1 (zero in subscript is the start state.)

Pk(0) = 0, for all k > 0 (0 in parenthesis is the imaginary first position)

Iteration:

for each j = 1..N

for each i = 1..k

Pj(i) = ej(xi)  maxk akj Pk(i-1)

Ptrj(i) = argmaxk akj Pk(i-1)

end

end

Termination:

Prob(x, *) = maxk Pk(N)

Traceback:

N* = argmaxk Pk(N)

i-1* = Ptri (i)


The viterbi algorithm1
The Viterbi Algorithm practice may not be the case.

Input: x = x1……xN

Initialization:

V0(0) = 0 (zero in subscript is the start state.)

Vk(0) = -inf, for all k > 0 (0 in parenthesis is the imaginary first position)

Iteration:

for each j

for each i

Vj(i) = rj(xi)+maxk (wkj+ Vk(i-1)) rj(xi) = log(ej(xi)), wkj = log(akj)

Ptrj(i) = argmaxk (wkj+ Vk(i-1))

end

end

Termination:

Prob(x, *) = exp{maxk Vk(N)}

Traceback:

N* = argmaxk Vk(N)

i-1* = Ptri (i)


The viterbi algorithm2
The Viterbi Algorithm practice may not be the case.

x1 x2 x3 ………………………………………..xN

Similar to “aligning” a set of states to a sequence

Time:

O(K2N)

Space:

O(KN)

State 1

2

Vj(i)

K



Problem
Problem practice may not be the case.

  • Is the most probable path necessarily the best one?

    • Single optimal vs multiple sub-optimal

      • What if there are many sub-optimal paths with slightly lower probabilities?

    • Global optimal vs local optimal

      • What’s best globally may not be the best for each individual


For example
For example practice may not be the case.

  • Probability for path 1-2-5-7 is 0.4

  • Probability for path 1-3-6-7 is 0.3

  • Probability for path 1-4-6-7 is 0.3

  • What’s the most probable state at step 2?

    • 0.4 goes through 5

    • 0.6 goes through 6

    • Viterbi may not be the only interesting answer


Another example
Another example practice may not be the case.

  • The dishonest casino

  • Say x = 12341623162616364616234161221341

  • Most probable path:  = FF……F

  • However: marked letters more likely to be L than unmarked letters

  • Another way to interpret the problem

    • With Viterbi, every pos is assigned a label

    • Confidence level for each assignment?


Posterior decoding
Posterior decoding practice may not be the case.

  • Viterbi finds the path with the highest probability

    • The probability is usually very tiny anyway

  • We want to know

    • k = 1


Probability of a sequence2
Probability of a sequence practice may not be the case.

  • The probability that a sequence is generated by a model. P(X | M)

  • Sometimes simply written as P(X)

  • Sometimes written as P(X | M, θ) or P(X | θ) to emphasize that we are looking for θ to optimize the likelihood

  • Not equal to the probability of a path P(X, )

    • There are many possible paths that can generate X

    • Each with a different probability

    • P(X) = P(X, ) = P(X | ) P()

    • Why do we need this?

    • How to compute without summing over all paths (exponential)?


The forward algorithm
The forward algorithm practice may not be the case.

  • Define fk(i) the probability to get to state k at step i

    • Sum over all possible previous paths

    • Remember the way we counted # alignments?


The forward algorithm1
The forward algorithm practice may not be the case.


The forward algorithm2
The forward algorithm practice may not be the case.

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) jfj(i-1) ajk

Termination:

Prob(x) = kfk(N)


Relation between forward and viterbi
Relation between Forward and Viterbi practice may not be the case.

VITERBI

Initialization:

P0(0) = 1

Pk(0) = 0, for all k > 0

Iteration:

Pk(i) = ek(xi) maxjPj(i-1) ajk

Termination:

Prob(x, *) = maxkPk(N)

FORWARD

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)


The backward algorithm
The backward algorithm practice may not be the case.

  • Define bk(i) as the probability for paths starting from position i within state k, to the end of the sequence

    • Sum over all possible paths from i to N


1 practice may not be the case.

This does not include the emission probability of xi


  • f practice may not be the case.k(i): prob to get to pos i in state k and emit xi

  • bk(i): prob from i to end, given i is in state k

  • What is fk(i) bk(i)?


Is your guess correct? practice may not be the case.


  • What we need is practice may not be the case.

  • But:

  • We have P(x) already in the forward algorithm

  • Can verify:

k = 1


The forward backward algorithm
The forward-backward algorithm practice may not be the case.

  • Compute fk(i), for each state k and pos i

  • Compute bk(i), for each state k and pos I

  • Compute P(x) = kfk(N)

  • Compute P(i=k | x) = fk(i) * bk(i) / P(x)


The prob of x, with the constraint that it has to pass state k on step i

Sequence

state

+

Forward probabilities

Backward probabilities

/ P(X)

Space: O(KN)

Time: O(K2N)

P(i=k | x)


Relation to another f b algorithm
Relation to another F-B algorithm k on step i

  • We’ve learned a forward-backward algorithm in linear-space sequence alignment

    • Hirscheberg’s algorithm

    • Also known as forward-backward alignment algorithm

y

x


What s p i k x good for
What’s k on step iP(i=k | x) good for?

  • For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level

  • Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding

    ^i = argmaxkP(i = k | x)


For example1
For example k on step i

If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die


Cpg islands again
CpG islands again k on step i

  • Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each

  • Viterbi: Post-process:

    • Found 46 of 48 46/48

    • plus 121 “false positives” 67 false pos

  • Posterior Decoding:

    • same 2 false negatives 46/48

    • plus 236 false positives 83 false pos

      Post-process: merge within 500; discard < 500


What if a new genome comes
What if a new genome comes? k on step i

We just sequenced the porcupine genome

We know CpG islands play the same role in this genome

However, we have no known CpG islands for porcupines

We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

- LEARNING


Learning
Learning k on step i

  • When the states are known

    • We’ve already done that

    • Estimate parameters from labeled data (known CpG or non-CpG)

    • “supervised” learning

  • When the states are unknown

    • Estimate parameters without labeled data

    • “unsupervised” learning


Basic idea
Basic idea k on step i

  • We estimate our “best guess” on the model parameters θ

  • We use θ to predict the unknown labels

  • We re-estimate a new set of θ

  • Repeat 2 & 3

Two ways


Viterbi training given estimate then re estimate
Viterbi Training k on step igiven θ estimate π; then re-estimate θ

  • Make initial estimates of parameters θ

  • Find Viterbi path π for each training sequence

  • Count transitions/emissions on those paths, getting new θ

  • Repeat 2&3

  • Not rigorously optimizing desired likelihood, but still useful & commonly used.

  • (Arguably good if you’re doing Viterbi decoding.)


Baum welch training given estimate ensemble then re estimate
Baum-Welch Training k on step igiven θ, estimate π ensemble; then re-estimate θ

  • Instead of estimating the new θ from the most probable path 

  • We can re-estimate θ from all possible paths

  • For example, according to Viterbi, pos i is in state k and pos (i+1) is in state l

    • This contributes 1 count towards the frequency that transition kl is used

    • In Baum-Welch, this transition is counted only partially, according the probability that this arc is taken by some path


Estimated # of k on step ikl transition


Why is this working
Why is this working? k on step i

  • Proof is complicated (ch 11 in Durbin book)

  • But basically,

    • The backward-forward algorithm computes the likelihood of the data given the HMM

    • When we re-estimate θ, we maximize P(sequence | θ).

    • Effect: in each iteration, the likelihood of the sequence will be improved

    • Therefore, guaranteed to converge (not necessarily to a global optimal)


Expectation maximization em
expectation-maximization (EM) k on step i

  • Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data

  • E-step: compute the expectation (e.g. prob for each pos to be in a certain state)

  • M-step: maximum-likelihood parameter estimation

  • We’ll see EM and similar techniques again in motif finding

  • k-means clustering has some similar flavor


Hmm summary
HMM summary k on step i

  • Viterbi – best single path

  • Forward – sum over all paths

  • Backward – similar

  • Baum-Welch – training via EM and forward-backward

  • Viterbi – another “EM”, but Viterbi-based


ad