Hidden Markov Models: A Comprehensive Guide

CSE182-L8 HMMs, Gene Finding CSE182-L8

Midterm 1 • In class next Tuesday • Syllabus: L1-L8 • Please review HW, questions, and other notes • Bring one sheet of written formulae • Most of the questions will test concepts, so hopefully, you don’t have to remember too much. • Final will be a take-home exam • We will discuss class projects next time. CSE182-L8

QUIZ! • Question: • your ‘friend’ likes to gamble. • He tosses a coin: TAILS, he gives you a dollar. HEADS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? • Suppose you know that in the loaded coin Pr(H)=0.8. Also, that he switches coins with 30% probability. CSE182-L8

Pr(L->L)=0.7 Pr(F->F)=0.7 F Pr(H)=0.8 Pr(H)=0.5 L Pr(F->L)=0.3 Pr(L->F)=0.3 Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF Modeling the coin problem as an HMM Q: Given the HMM, can you predict the hidden states? CSE182-L8

HMM? L F • It is an automaton with different ‘states’ (Ex: L,F) • We start from a specific start state • Each state of the automaton generates a symbol from an alphabet ({‘H’,’T’}), according to a distribution. • After generating, we move (transition) to a new state. • While the output (generated symbols) are visible, the states themselves are hidden. Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF CSE182-L8

0.71 0.14 0.28 0.14 Profile HMMs • Many natural problems can be expressed as an HMM. Ex: protein sequence profiles • Recall that a profile is created by looking at position specific distributions CSE182-L8

The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182-L8

A simple Profile HMM • Connect nodes for each column into a chain. This chain generates sequences based on the profile distribution. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182-L8

Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182-L8

Example A L - L A I V L A I - L • The HMM M models a family of 3 sequences. • Question: Is sequence S=ALIL part of the family? • Solution: Compute Pr(S|M), the probability that the HMM M generated S. • If Pr(S|M) is ‘High’, S belongs to the family. If it is ‘Low’, S does not belong. • How can we compute Pr(S=ALIL|M)? CSE182-L8

Example A L - L A I V L A I - L • Probability [ALIL|M]? • Note that multiple paths can generate this sequence. • P1=M1I1M2M3 • P2=M1M2I2M3 • We ask a simpler question. Suppose a specific path was given. • Compute the joint probability Pr(S,P|M) CSE182-L8

Formally • The emitted sequence is S=S1S2…Sm • The path traversed is P1P2P3.. • ej(s) = emission probability of symbol s in state Pj • Transition probability T[Pj,Pk] : Probability of transitioning from state Pj to state Pk. • Pr(P,S|M) = eP1(S1) T[P1,P2] eP2(S2) …… • Let P[k,n] = Pr(P1…Pk,S1…Sn|M) • Then, P[k,n] = P[k-1,n-1] T[Pk-1,Pk] ePk(Sn) • What is Pr(S|M)? CSE182-L8

Two solutions • An unknown (hidden) path is traversed to produce (emit) the sequence S. • The probability that M emits S can be either • The sum over the joint probabilities over all paths. • Pr(S|M) = ∑P Pr(S,P|M) • OR, it is the probability of the most likely path • Pr(S|M) = maxP Pr(S,P|M) • Both are appropriate ways to model, and have similar algorithms to solve them. CSE182-L8

Viterbi Algorithm for HMM A L - L A I V L A I - L • Let Pmax(i,j|M) be the probability of the most likely solution that emits S1…Si, and ends in state j (is it sufficient to compute this?) • Pmax(i,j|M) = max k Pmax(i-1,k) T[k,j] ej(Si) (Viterbi) • Psum(i,j|M) = ∑ k(Psum(i-1,k) T[k,j]) ej(Si) CSE182-L8

Profile HMM membership A L - L A I V L A I - L • We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. • Backtracking can be used to get the path, which allows us to give an alignment A L I L Path:M1 M2 I2 M3 CSE182-L8

Summary • To search for remote homologs (family members) of a protein sequence, we need to have alignments of “disparate protein sequences”. • HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. • Patterns/Profiles/HMMs allow us to represent families and foucs on key residues • Each has its advantages and disadvantages, and needs special algorithms to query efficiently. CSE182-L8

Protein Domain databases HMM • A number of databases capture proteins (domains) using various representations • Each domain is also associated with structure/function information, parsed from the literature. • Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D CSE182-L8

Sequence databases • We talked about DNA and protein sequence databases. • Next, we will talk about the connection between DNA and proteins • Proteins are the cellular machinery • DNA is the inherited material, and must contain the code for generating proteins • Solution of the genetic code was one of the great intellectual achievements of the last century CSE182-L8

Gene Finding What is a Gene? CSE182-L8

Gene • We define a gene as a location on the genome that codes for proteins. • The genic information is used to manufacture proteins through transcription, and translation. • There is a unique mapping from triplets to amino-acids CSE182-L8

Eukaryotic gene structure CSE182-L8

Translation • The ribosomal machinery reads mRNA. • Each triplet is translated into a unique amino-acid until the STOP codon is encountered. • There is also a special signal where translation starts, usually at the ATG (M) codon. CSE182-L8

Translation • The ribosomal machinery reads mRNA. • Each triplet is translated into a unique amino-acid until the STOP codon is encountered. • There is also a special signal where translation starts, usually at the ATG (M) codon. • Given a DNA sequence, how many ways can you translate it? CSE182-L8

ATG 5’ UTR 3’ UTR exon intron Translation start Acceptor Donor splice site Transcription start Gene Features CSE182-L8

End of L8 CSE182-L8

Gene identification • Eukaryotic gene definitions: • Location that codes for a protein • The transcript sequence(s) that encodes the protein • The protein sequence(s) • Suppose you want to know all of the genes in an organism. • This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. • All of that changed with better reagents and the development of high throughput methods like EST sequencing CSE182-L8

ATG 5’ UTR 3’ UTR exon intron Translation start Acceptor Donor splice site Transcription start Computational Gene Finding • Given Genomic DNA, identify all the coordinates of the gene • TRIVIA QUIZ! What is the name of the FIRST gene finding program? (google testcode) CSE182-L8

Gene Finding: The 1st generation • Given genomic DNA, does it contain a gene (or not)? • Key idea: The distributions of nucleotides is different in coding (translated exons) and non-coding regions. • Therefore, a statistical test can be used to discriminate between coding and non-coding regions. CSE182-L8

Coding versus non-coding • You are given a collection of exons, and a collection of intergenic sequence. • Count the number of occurrences of ATGATG in Introns and Exons. • Suppose 1% of the hexamers in Exons are ATGATG • Only 0.01% of the hexamers in Intergenic are ATGATG • How can you use this idea to find genes? CSE182-L8

Generalizing Frequencies (X10-5) I E X AAAAAA AAAAAC AAAAAG AAAAAT 10 10 5 5 20 10 • Compute a frequency count for all hexamers. • Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space • Use this to decide whether a sequence X is exonic/intergenic. CSE182-L8

A geometric approach (2 hexamers) • Plot the following vectors • E= [10, 20] • I= [10, 5] • V3 = [6, 10] • V4 = [9, 15] • Is V3 more like E or more like I? E 20 15 V3 10 I 5 5 10 15 CSE182-L8

Choosing between Intergenic and Exonic • V’ = V/||V|| • All vectors have the same length (lie on the unit circle) • Next, compute the angle to E, and I. • Choose the feature that is ‘closer’ (smaller angle. E V3 I CSE182-L8

Coding versus non-coding signals • Fickett and Tung (1992) compared various measures • Measures that preserve the triplet frame are the most successful. • Genscan uses a 5th order Markov Model CSE182-L8

5th order markov chain A EXON • PrEXON[AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]…… = (20/78) (50/78)………. C AAAAC AAAAAA 20 1 AAAAAC 50 10 AAAAAG 5 30 AAAAAT 3 .. AAAAA G AAAAG CSE182-L8

Scoring for coding regions • The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron. • In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions CSE182-L8

Coding differential for 380 genes CSE182-L8

Coding region can be detected • Plot the coding score using a sliding window of fixed length. • The (large) exons will show up reliably. • Not enough to predict gene boundaries reliably Coding CSE182-L8

Other Signals • Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise. • When combined they can be effective ATG AG GT Coding CSE182-L8

Combining Signals • We can compute the following: • E-score[i,j] • I-score[i,j] • D-score[i] • I-score[i] • Goal is to find coordinates that maximize the total score i j CSE182-L8

The second generation of Gene finding • Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure. • It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low. • Problem with multiple genes in a genomic region CSE182-L8

Combining signals using D.P. • An HMM is the best way to model and optimize the combination of signals • Here, we will use a simpler approach which is essentially the same as the Viterbi algorithm for HMMs, but without the formalism. CSE182-L8

i1 i2 i3 i4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Hidden states & gene structure • Identifying a gene is equivalent to labeling each nucleotide as E/I/intergenic etc. • These ‘labels’ are the hidden states • For simplicity, consider only two states E and I CSE182-L8

i1 i2 i3 i4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Gene finding reformulated • Given a labeling L, we can score it as • I-score[0..i1] + E-score[i1..i2] + D-score[i2+1] + I-score[i2+1..i3-1] + A-score[i3-1] + E-score[i3..i4] + ……. • Goal is to compute a labeling with maximum score. CSE182-L8

Optimum labeling using D.P. (Viterbi) • Define VE(i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled E • Define VI(i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled I • Why is it enough to compute VE(i) & VI(i) ? CSE182-L8

Optimum parse of the gene j i j i CSE182-L8

Generalizing • Note that we deal with two states, and consider all paths that move between the two states. E I CSE182-L8 i

Generalizing IG • We did not deal with the boundary cases in the recurrence. • Instead of labeling with two states, we can label with multiple states, • Einit, Efin, Emid, • I, IG (intergenic) I Efin Emid Note: all links are not shown here Einit CSE182-L8

An HMM for Gene structure CSE182-L8

Generalized HMMs, and other refinements • A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described • In standard HMMs, there is an exponential distribution on the duration of time spent in a state. • This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs. CSE182-L8

Hidden Markov Models: A Comprehensive Guide

Hidden Markov Models: A Comprehensive Guide

Presentation Transcript

CSE182-L10

L8. Reviews

CSE182-L12

CSE182-L11

CSE182-L12

CSE182-L9

CSE182-L9

Collisions (L8)

CSE182-L7

CSE182-L10

CSE182-L13

CSE182-L18

CSE182-L8

L8-1000

Collisions (L8)