Lecture 5

Lecture 5 Hidden Markov Models Jones & Pevzner, Chapt. 11 (handout)

What are HMMs? • An important machine learning method. Widely used in sequence analysis, voice recognition, pattern analysis. • A state machine. • An HMM has a finite number of states. • Each state can emit a character, and make a transition to another state. • An HMM is a probabilistic state machine; character emissions and state transitions are not deterministic, but are random events that obey fixed probability distributions.

Focus problem - CG Islands • CG is the most infrequently-observed dinucleotide. • The C in this pair is easily methylated, and subsequently replaced with a T. • Methylation is an important mechanism for silencing genes, and genes that are actively-expressed are in regions of the genome that are maintained in an unmethylated state. As a consequence, CG dinucleotides are observed much more frequently in regions of the genome that include active genes. • Location of a putative gene in a “CG island” provides important supporting evidence that it is in fact an expressed gene. • Identifying CG islands computationally requires that we be able to observe a subtle shift in dinucleotide frequencies.

Related problem - the “Fair Bet Casino” • In the Fair Bet Casino you can wager on the outcome of a coin-flipping game. • The trouble is, the dealer uses two different coins, one is fair and the other is weighted. He changes coins randomly, but infrequently. • Given a series of tosses, can we estimate when the fair coin was being used and when the biased one was in play?

The probabilities… For the outcome of a single flip (0=tails, 1=heads) - The fair coin: The biased coin: For a series of throws x1x2…xn, (where there are k heads…)

An elementary approach: • Slide a window n-characters long against the sequence of coin flips, and at each position compute a measure of the likelihood that the fair coin was used as opposed to the biased one; we often use the log-odds ratio:

The elementary approach - evaluation • If the ratio of probabilities of fair to biased is unity, then n - klog23 = 0, or k = n/log23. If we have this many heads in the sequence, we have equal chance of a fair or biased coin. • If the ratio is > unity, then we are more likely to have a fair coin; in this case n - klog23 > 0, or k < n/log23 • If the ratio is < unity, then it is more likely we have a biased coin; in this case n - klog23 < 0, or k > n/log23

How useful is the elementary approach? • Our underlying assumption is that only one variety of coin is used to generate a sequence; we cannot consider the possibility that the coins are swapped in the middle of generating the sequence. • If we slide a window across the sequence, we need to “capture” a subsequence generated by either the fair or biased coins. • Do we know ahead of time how big the islands will be? In fact they will have a distribution of sizes, which cannot be captured using a window of fixed width.

The Hidden Markov (state machine) Approach 9/10 9/10 1/10 F B 1/10 3/4 1/4 1/2 1/2 H(1) H(1) T(0) T(0)

Components of the Model • S, an alphabet of symbols to be emitted. • Q, a set of states, each of which can emit symbols from the alphabet S. • A, a |Q|X|Q| matrix; Aij is the probability that the machine will switch to state j from state i. These are the transition probabilities. • E = (ek(b)), a |Q|X|S| matrix of probabilities. ek(b) is the probability of emitting character b while in state k. These are the emission probabilities.

The Components of the Fair Bet Casino Model • S = {0,1} corresponding to tails (0) or heads (1). • Q = {F, B}, corresponding to fair or biased coin. • AFF = ABB = 0.9, AFB = ABF = 0.1 • eF(0) = eF(1) = 0.5, eB(0) = 0.25, eB(1) = 0.75

Paths in HMMs A path in an HMM is a sequence of states (not emitted characters). It is symbolized We can match up an observed sequence of characters with a hypothetical series of states:

The Probability of a Path Need to factor in the probabilities of transitions between states:

Notational Extensions… • In the preceding equation, there is an initial probability transition to get us to the first state:This is introduced to get us into the sequence; we can set it to 1/2 to represent that the preceding state (fair or biased) is unknown. • We also introduce a final probability for exiting the sequence. I have set this to unity (10/10)

Examples: • For the given sequence, we compute the probability of the path FFFBBBBBFFF to be 2.66 X 10-6. Is this an optimal choice for the path? • NO - the sequence FFFBBBFFFFF has a probability of 3.54 X 10-6.

The HMM Decoding Problem: • Given a sequence of observed characters x1x2…xn generated by the HMM M=(S,Q,A,E), find the path the maximizes P(x|path).

Solving the Decoding Problem • Solution due to Viterbi (1967): Use dynamic programming to find the optimal series of states. • Have a matrix of n columns and Q rows. Being in the kth row of the ith column associates state k with character i. (The columns are thus labeled by the characters xi of the sequence.) • Each state in column i is connected by an edge to every state in column i+1; the edge from (k,i) to (l,i+1) has weight el(xi+1)Akl. • A path through the matrix corresponds to a path in the HMM; multiplying the weights of all the edges reproduces the probability of the HMM path. • The optimal path through the matrix is the optimal path in the HMM.

Viterbi Decoding: The path shown corresponds to FBFB 0 1 1 0 F F F F B B B B

Parameter Estimation • The Viterbi algorithm assumes that we already know the emission and transition probabilities, and given these we want to know the most probable series of states for an observed sequence of characters. • The usual case is that we don’t know the parameters (probabilities), and need to estimate them from data. • General approach; We are given a set of training strings, and our goal is to find the parameter set which maximizes the probability for generating the strings. This is a difficult problem; we need an initial set of parameters which are then adjusted by an optimization algorithm. The Baum-Welch algorithm is a popular iterative method.

An easier situation… • Sometimes we are blessed in knowing not just the sequence of characters in a set of training strings, but also the state sequences! In this lucky situation we can directly estimate the probabilities by accumulating simple statistics:Let Akl be the number of times we observe transitions from state k to l, and let Ek(b) be the number of times we observe character b emitted by state k; then our estimated parameters are

Lecture 5

Lecture 5

Presentation Transcript

Lecture 5

Lecture 5

[lecture#5]

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5