Lecture 7. Hidden Markov Models

Lecture 7. Hidden Markov Models The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture outline • Motivating example: gene finding • Markov models and Hidden Markov models • The three problems related to HMM • Computing data likelihood • Using a model • Learning a model • Applications CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 1 Motivating Example

Gene structure • Recall the structure of a typical eukaryotic (higher organisms, with a cell nucleus) gene: A gene Image credit: Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

From sequences to meanings • Suppose you are given the following DNA sequence: • TCGTGTCTAGTCGGATATAGCTATGCATCTAGCTGAGGCGATCTGAGCGGATCGATGCTAGGGCGATCGGAGCTAGCTGAGCTAGCTAGCTGAGCGCTAGCGAGCGTACGAGCGATCGAGCGAGTCTAGCGAGCGATTCTAGCGATCGAGCGTCTACGATCGTATGCTAGCTAGGGCTAGCATGCGGATCTATCGAGCGGCTAT • Some questions: • How likely is it part of a gene? • If it is a gene, where are the • Exons and introns? • Coding sequence (open reading frame) and untranslated regions? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Useful information • Some patterns that could be useful for answering the questions: • An intron starts with a “donor site”, usually with sequence GT • An intron ends with an “acceptor site”, usually with sequence AG • The coding sequence is composed of codons (nucleotide triplets) • It should start with the start codon ATG • It should end with a stop codon • Some amino acids (and thus the codons) are more common • Some amino acids frequently appear together • Difficulty: There are exceptions and many false positives Image credit: http://svmcompbio.tuebingen.mpg.de/img/splice.png , Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Constructing and using a gene model • General approach to answering the questions: • Identify typical sequence patterns on known genes with annotated exons, introns, etc. • Provide an abstraction of the patterns in the form of a (statistical) model • Use the model to evaluate the potential that an input sequence is part of a gene • We are going to study one type of commonly-used models for this purpose: the hidden Markov models (HMMs) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Illustration of a rule-based model • Learning a model for a sequence pattern: • Training examples: • AAACCCCAA • ACCCCCAA • AAAAACAA • AACCCCCAA • Model (abstraction): • Some A’s, followed by some C’s, followed by two A’s • Application: look for new instances • CGAAACCGGCATCGGACCAACTTCAACCGTCACG This model is based on exact rules (a subsequence either matches or not matches the pattern), while the type of models we are going to study next is based on probabilities (a subsequence matches the pattern with a certain probability) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Illustration of a probabilistic model • Learning a model for a sequence pattern: • Training examples: • AACAAAC • ACCAAACA • ACAAAC • AACAAAC • Possible models: • Model 1: For every position, 70% chance A, 30% chance C • Model 2: For every position, 10% chance A, 90% chance C • Which model do you think is more likely correct? • Also missing: dependency between positions  Need more advanced models (HMMs) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 2 Markov Models and Hidden Markov Model

Sequence of events • Suppose we play this game: • I have two coins in my pocket. • Coin A is fair, with 50% chance of head (H) and 50% chance of tail (T) in a toss. • Coin B is unfair, with 25% chance of H and 75% chance of T in a toss. • Now I do this: • Take out one of the coins from my pocket (in a way that will be described later) without letting you know which one it is. • Throw the coin and let you see the outcome. • Repeat #1 and #2 for a few times (each time I could use a different coin) • Then you need to guess which coin did I use each time. Example modified from one provided in Chapter 12 of Ewens and Grant, Statistical Methods in Bioinformatics (2nd Edition), Springer (2005). CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Example • Suppose you see this sequence of outcomes: • T, H, T • Which coins did I use each time? • Impossible to tell, because each time both coins could have produced the outcome • A more sensible question: Which “coin sequence” is most likely, given the observed “outcome sequence”? • Need to first know something about how I chose the coin • Why is it related to gene modeling? • Suppose you observe ACCGATAGCTTAC... and you want to guess at each position, whether I used the “exon coin” or the “intron coin” to produce the nucleotide... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology and symbols • To facilitate our discussion, we need to define some terms and symbols: • O is a sequence of observed outcomes. Let m be the length of O, then the m outcomes are written as o1, o2, ..., om • In our example, m=3 and O=THT • S is a set of possible states, known as s1, s2, ... sN. • In our example, each state is a coin, s1=A and s2=B • Q is a sequence of m states that produced the m corresponding observations, known as q1, q2, ..., qm. The actual value of each qt should be one of the Nsi’s • In our example, Q could be AAB, BAB, etc. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology and symbols • For each state si, the probability that a particular outcome o is produced (“emitted”) is written as ei(o) • In our example, these “emission probabilities” can be summarized by the following table: e2(H) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology and symbols • How did I determine which coin to use each time? • First time: Determined randomly based on fixed initial probabilities, i for state si, i.e., i = Pr(q1=si) • In our example, suppose 1=Pr(q1=A)=0.5 and 2=Pr(q1=B)=0.5, i.e., I just randomly took one from my pocket with equal chance • Second time onwards: Determined randomly based on the coins I actually used before • It means I had to determine qt based on q1, q2, ..., qt-1 • If I determined qt based on qt-1 only, it is said that I assumed a first-order Markov model • If I determine qt based on qt-1 and qt-2 only, it is said that I assumed a second-order Markov model • And so on for higher orders CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology and symbols • In our example, we assume a first-order Markov model • Given that si was chosen at time t, the probability of choosing state sj at time t+1 is called the transition probability from si to sj • It is written as pij, which is defined as pij = Pr(qt+1=sj | qt=si) • These probability values do not change over time • In our example, suppose we have the following transition probabilities: • Say coin B is heavier and can easily sink to the bottom of my pocket, or I tend not to use the unfair coin B too often CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Terminology and symbols • A hidden Markov model is fully described by the parameters =(, P, E) • Probabilities of: Initial states, state transitions and emissions of observations • Capital letter means the whole set of variables in small letters. For example, P is the set that contains the variables p11, p12, ..., p1N, p21, ..., pNN. • Sometimes we also use the set notation: P={pij}, where i and j are state indices in this case with the following value ranges: 1  i  N and 1  j  N. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Summary of symbols • States: • si: The i-th possible state. There are N possible states in total • S ={si} • Model parameters: • i: Initial probability to enter state i • pij: Transition probability from state si to state sj • ei(o): Probability of emitting o at state si •  = {i} • P = {pij} • E = {ei(o)} •  = (, P, E) • Hidden states: • qt: Unknown state at time t. There are m time points in total • Q = {qt} • Observed outcomes: • ot: Observed outcome at time t • O = {ot} CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Graphical representation • From my perspective: • From your perspective: • Note: The most likely state sequence statistics-wise is not necessarily the actual state sequence The model: The model: A possible run: 0.9 0.9 0.5 0.5 H H B A A 0.5 0.5 A A T T 0.5 0.5 0.1 0.1 0.8 0.8 0.25 0.25 H H 0.5 0.5 B B T T T H T 0.75 0.75 0.2 0.2 ? ? ? T H T CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Linking back to gene model • See the close correspondence between our example and the gene finding scenario using the defined terminology: • Coming back to our question: “Given the outcome sequence T, H, T, which coin sequence was most likely?” • Using our terminology, we can restate it as “Given the observed outcome sequence O and the model parameters , what state sequence Q was most likely among all possible state sequences?” • We first assume the initial and transition probabilities are known • Since the states were not observed (i.e., “hidden”), the model is a “hidden” Markov model • Using the symbols defined, we can write the question as finding the following: (“Q such that Pr(Q|O,) is maximized”) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Basic probabilities: revision • Pr(X)Pr(Y|X) = Pr(X and Y) • If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop • Capital letters mean it is true for all values of X and Y • Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X and Y • Law of total probability: (The summation should consider all possible values of Y) • If there is • A 0.12 chance that it will both rain and have a temperature drop tomorrow, and • A 0.08 chance that it will both rain and not have a temperature drop tomorrow • Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow • Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y)  0 • Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y) • Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z)  0 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Finding the most likely state sequence • Back to our question: • Pr(Q|O,) = Pr(O|Q,)Pr(Q|) / Pr(O|) [Bayes’ rule] • If we want to know which Q would maximize the above, we do not need to care about the part that does not change for different Q, i.e., Pr(O|) • Therefore, we only need to compare Pr(O|Q,)Pr(Q|) for different values of Q • In other words, • Why do we want to write in this way? • Because we cannot compute Pr(Q|O,) directly, but we can compute Pr(O|Q,) and Pr(Q|) as we will see CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

How to find Q that maximizes Pr(O|Q,)Pr(Q|)? • Simplest way: Try all possibilities HMM parameters  Observed outcomes: O=THT Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Most likely CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Another example HMM parameters  Observed outcomes: O=HHH Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Most likely CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

How much time does the calculation take? • For each Q: compute the product of • m emission probabilities • 1 initial probability • m-1 transition probabilities • Total: 2m, which is OK • Number of possible Q’s • Nm, which is NOT OK • How to get the answer using less time? • Reusing partial results  Dynamic programming! CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 3 The three problems Related to HMM

The three problems • There are three fundamental problems related to hidden Markov models: • Evaluating data likelihood: Given a fully specified HMM (i.e., with known ), compute the probability of getting a sequence of outcomes • Pr(O|) • Using a model: Given a fully specified HMM and a sequence of outcomes, determine the most likely sequence of states (the problem we have just studied) • Learning a model: Given a set of possible states S and a set of multiple outcome sequences (without knowing ), determine the parameters of a HMM that maximizes the data likelihood CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Problem 1: Evaluating data likelihood • Given a fully specified HMM, compute the probability of getting a sequence of outcomes • Pr(O|) • Since the actual sequence of states is unknown, we need to try them all: • Pr(O|Q,) is simply a product of emission probabilities • Pr(Q|) is simply a product of initial and transition probabilities • However, the problem is again the exponential number of possible state sequences Q that need to be summed up • We now study dynamic programming algorithms that can solve this problem efficiently CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The forward algorithm • First we define (t,i)=Pr(o1, o2, ..., ot, qt=si|), which is the probability that we are at state si in step t and we have observed o1, o2, ..., ot. • The problem of computing (t,i) can be solved by solving some sub-problems: • If we were at state sj in step t-1, we could reach the current situation by transiting from state sj to state si, and then emitting ot • Since we do not really know which state we were at in step t-1, so we sum all possibilities CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Derivation of recursive formula CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The forward algorithm (cont’d) • Recall the definition: (t,i)=Pr(o1, o2, ..., ot, qt=si|) • Now we treat  as a dynamic programming table to calculate Pr(O|) – calculate the probabilities forward: • Initializations: When t=1, the table entries can be computed directly for any i:(1,i) = Pr(o1,q1=si|) = Pr(q1=si|) Pr(o1|q1=si ,) = iei(o1) • Recursion: For any i, (t,i) can be computed from all the (t-1,j)’s: • The final value of Pr(O|) can be computed from the (m,j)’s: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Example Observed outcomes: O=THT • To compute: Pr(O=THT|) • The  table: (t,i)=Pr(o1, o2, ..., ot, qt=si|) • (1,i) = iei(o1) HMM parameters  Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Final answer: 0.128125+0.0234375 = 0.1515625 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Time and space analysis • The table contains Nm entries (reminder: N=number of states, m=length of observed sequence) • Computing the value of each entry requires summing up N numbers • Each of these numbers is produced by a constant number of product operations • The total amount of time required is proportional to N2m • Since (t,i) depends only on (t-1,j), we can throw away everything two steps before. The space needed is thus proportional to N • Also N|U| for storing the emission probabilities (where U is the set of possible outcomes), N for the initial probabilities, and N2 for the transition probabilities of the HMM CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The backward algorithm • For the same problem, there could be multiple solutions • Here we can find the data likelihood by another algorithm • First we define (t,i)=Pr(ot+1, ot+2, ..., om | qt=si,) • (t,i) can be computed from (t+1,i): CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The backward algorithm • (t,i)=Pr(ot+1, ot+2, ..., om | qt=si,) • Now we treat  as a dynamic programming table to calculate Pr(O|) – calculate the probabilities backward: • Initializations: • Recursion: • Final answer: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Problem 2: Using a model • Given a fully specified HMM and a sequence of outcomes, determine the mostly likely state sequence • We have already learned how to do it by listing all possible state sequences. Now we want to do it faster (using dynamic programming) • The first observation is that instead of maximizing the conditional probability, we can maximize the joint probability as well: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The Viterbi algorithm • Again, we want to break down the original problem into simpler sub-problems: • Consider states and observations up to step t • Assume we know the state at step t is si • Then we can solve each sub-problem recursively by solving its sub-problems • We will first find out the maximum probability , then find out a state sequence Q that produces such a probability CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Find the maximum probability • As discussed, we define a dynamic programming table for the sub-problems: • Initializations: • Recursion: • Final answer: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Trace back • At each step, (t, i) is determined by taking the maximum of (t-1, j) for different j’s • We can imagine drawing “red arrows” from these (t-1, j) cells to the (t, i) cells. Trace back is then simply following the red arrows backwards (t, i) i=1 i=2 ... i=N-1 i=N t=1 t=2 ... t=m-1 t=m CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Problem 3: Learning a model • Given a set of possible states S and a set of outcome sequences, determine the parameters of a HMM that maximize the data likelihood • It is a difficult problem with a large “search space” (possible combinations of parameter values) • We will study a heuristic method CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The Baum-Welch algorithm • Assume we have a set of observed outcome sequences: • {O(d)} = {O(1), O(2), ...} • Each element is a sequence of outcomes:O(d)= o1(d), o2(d), ..., om(d) • Correspondingly, there is a set of hidden state sequences: • {Q(d)} = {Q(1), Q(2), ...} • Each element is a sequence of states:Q(d)= q1(d), q2(d), ..., qm(d) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The Baum-Welch algorithm • High-level procedures: • Assign some initial values to each parameter variable i, pij and ei(ok) • Update the values based on current estimates: • i to the expected fraction of times that the first state is si • pij to the expected fraction of times that the next state is sj given the current state is si • ei(ok) to the expected fraction of times that the outcome ok is emitted given the current state is si • Repeat #2 until the increase in Pr({O(d)|}) across two iterations is smaller than a threshold • It can be proved that Pr({O(d)|}) never decreases • However, there is no guarantee that the final set of parameter values is one that maximizes Pr({O(d)|}) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

The Baum-Welch algorithm • We will not discuss all the details of the algorithm, but just one interesting part • The updates in #2 require this quantity:which is the probability that in the d-th sequence, the t-th state was si and the (t+1)-th state was sj, given the d-th outcome sequence and the current estimates of the parameter values, • We have learned how to compute the denominator efficiently (Problem 1) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Forward and backward variables • How about the numerator? • To simplify the notation, let’s suppose there is only one outcome sequence, O • The numerator is then (not showing the derivation): • We have learned how to compute the first two terms (the “forward and backward variables”) • The last two terms can be read directly from the current estimates • When the probability of a hidden state is estimated by both outcomes before and after it, the process is called “smoothing” CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Part 4 Applications

Applying HMMs to biological sequences • In a real situation, the first task is to learn a model, by either • Data with outcomes only (e.g., sequences that contain genes) OR • Data with both outcomes and states (e.g., annotated genes with known exons, introns, etc.) • Then the model is used to scan unannotated sequences to identify those that match the model well CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Biological applications • Genes • Sequence profiles • Protein binding sites on DNA • Whole-genome segmentation • RNA structures • Protein domains • Protein families • ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

HMM Gene models Overall: • VEIL • Henderson et al., Journal of Computational Biology 4(2):127-141, (1997) • Overall model, then sub-model for each module Exon and stop codon: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

Lecture 7. Hidden Markov Models

Lecture 7. Hidden Markov Models

Presentation Transcript

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Lecture 9 Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Class 7: Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Lecture 6. Hidden Markov Models

Hidden Markov Models