CSE182-L10

CSE182-L10 HMM applications

Probability of being in specific states • What is the probability that we were in state k at step I? Pr[All paths that passed through state k at step I, and emitted x] Pr[All paths that emitted x]

The Forward Algorithm • Recall v[i,j] : Probability of the most likely path the automaton chose in emitting x1…xi, and ending up in state j. • Define f[i,j]: Probability that the automaton started from state 1, and emitted x1…xi • What is the difference? x1…xi

Most Likely path versus Probability of Arrival • There are multiple paths from states 1..j in which the automaton can output x1…xi • In computing the viterbi path, we choose the most likely path • V[i,j] = maxπ Pr[x1…xi|π] • The probability of emitting x1…xi and ending up in state j is given by • F[i,j] = ∑π Pr[x1…xi|π]

The Forward Algorithm • Recall that • v(i,j) = max lQ {v(i-1,l).A[l,j] }.ej(xi) • Instead • F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi) 1 j

The Backward Algorithm • Define b[i,j]: Probability that the automaton started from state i, emitted xi+1…xn and ended up in the final state xi+1…xn x1…xi 1 m i

Forward Backward Scoring • F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi) • B[i,j] = ∑lQ (A[j,l].el(xi+1) B(i+1,l)) • Pr[x,πi=k]=F(i,k) B(i,k)

1 2 3 4 5 6 7 8 A C G T 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 Application of HMMs • How do we modify this to handle indels?

Applications of the HMM paradigm • Modifying Profile HMMs to handle indels • States Ii: insertion states • States Di: deletion states 1 2 3 4 5 6 7 8 A C G T 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

Profile HMMs • An assignment of states implies insertion, match, or deletion. EX: ACACTGTA 1 2 3 4 5 6 7 8 A C G T 0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 0.0 0.2 0.7 0.0 0.3 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.3 1.0 0.3 0.0 0.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0 C A A C T G T A

Viterbi Algorithm revisited • Define vMj (i)as the log likelihood score of the best path for matching x1..xi to profile HMM ending with xi emitted by the state Mj. • vIj(i)andvDj(i)are defined similarly.

Viterbi Equations for Profile HMMs vMj-1(i-1) + log(A[Mj-1, Mj]) vMj(i) = log (eMj(xi)) + max vIj-1(i-1) + log(A[Ij-1, Mj]) vDj-1(i-1) + log(A[Dj-1, Mj]) vMj(i-1) + log(A[Mj-1, Ij]) vIj(i) = log (eIj(xi)) + max vIj(i-1) + log(A[Ij-1, Ij]) vDj(i-1) + log(A[Dj-1, Ij])

Compositional Signals • CpG islands. In genomic sequence, the CG di-nucleotide is rarely seen • CG helps methylation of C, and subsequent mutation to T. • In regions around a gene, the methylation is suppressed, and therefore CG is more common. • CpG islands: Islands of CG on the genome. • How can you detect CpG islands?

An HMM for Genomic regions • Node A emits A with Prob. 1, and 0 for all other bases. • The start and end node do not emit any symbol. • All outgoing edges from nodes are equi-probable, except for the ones coming out of C. A G .25 0.1 end start C 0.4 T .25

An HMM for CpG islands • Node A emits A with Prob. 1, and 0 for all other bases. • The start and end node do not emit any symbol. • All outgoing edges from nodes are equi-probable, except for the ones coming out of C. A G 0.25 0.25 end start C 0.25 T

HMM for detecting CpG Islands A B A G A 0.1 end G start end C start 0.4 T C T • In the best parse of a genomic sequence, each base is assigned a state from the sets A, and B. • Any substring with multiple states coming from B can be described as a CpG island.

HMM: Summary • HMMs are a natural technique for modeling many biological domains. • They can capture position dependent, and also compositional properties. • HMMs have been very useful in an important Bioinformatics application: gene finding.

CSE182-L10

CSE182-L10

Presentation Transcript

L10

CSE182-L10

L10

CSE182-L12

CSE182-L11

CSE182-L12

L10

CSE182-L9

CSE182-L9

CSE182-L7

L10

CSE182-L13

L10

CSE182-L18

CSE182-L10