1 / 66

CSE182-L8

CSE182-L8. HMMs, Gene Finding. Midterm 1. In class next Tuesday Syllabus: L1-L8 Please review HW, questions, and other notes Bring one sheet of written formulae Most of the questions will test concepts, so hopefully, you don’t have to remember too much. Final will be a take-home exam

ostby
Download Presentation

CSE182-L8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L8 HMMs, Gene Finding CSE182-L8

  2. Midterm 1 • In class next Tuesday • Syllabus: L1-L8 • Please review HW, questions, and other notes • Bring one sheet of written formulae • Most of the questions will test concepts, so hopefully, you don’t have to remember too much. • Final will be a take-home exam • We will discuss class projects next time. CSE182-L8

  3. QUIZ! • Question: • your ‘friend’ likes to gamble. • He tosses a coin: TAILS, he gives you a dollar. HEADS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? • Suppose you know that in the loaded coin Pr(H)=0.8. Also, that he switches coins with 30% probability. CSE182-L8

  4. Pr(L->L)=0.7 Pr(F->F)=0.7 F Pr(H)=0.8 Pr(H)=0.5 L Pr(F->L)=0.3 Pr(L->F)=0.3 Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF Modeling the coin problem as an HMM Q: Given the HMM, can you predict the hidden states? CSE182-L8

  5. HMM? L F • It is an automaton with different ‘states’ (Ex: L,F) • We start from a specific start state • Each state of the automaton generates a symbol from an alphabet ({‘H’,’T’}), according to a distribution. • After generating, we move (transition) to a new state. • While the output (generated symbols) are visible, the states themselves are hidden. Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF CSE182-L8

  6. 0.71 0.14 0.28 0.14 Profile HMMs • Many natural problems can be expressed as an HMM. Ex: protein sequence profiles • Recall that a profile is created by looking at position specific distributions CSE182-L8

  7. The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182-L8

  8. A simple Profile HMM • Connect nodes for each column into a chain. This chain generates sequences based on the profile distribution. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182-L8

  9. Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182-L8

  10. Example A L - L A I V L A I - L • The HMM M models a family of 3 sequences. • Question: Is sequence S=ALIL part of the family? • Solution: Compute Pr(S|M), the probability that the HMM M generated S. • If Pr(S|M) is ‘High’, S belongs to the family. If it is ‘Low’, S does not belong. • How can we compute Pr(S=ALIL|M)? CSE182-L8

  11. Example A L - L A I V L A I - L • Probability [ALIL|M]? • Note that multiple paths can generate this sequence. • P1=M1I1M2M3 • P2=M1M2I2M3 • We ask a simpler question. Suppose a specific path was given. • Compute the joint probability Pr(S,P|M) CSE182-L8

  12. Joint probability computation • Joint probability of seeing a sequence S, and path P • Pr[S,P|M] = Pr[S|P,M] Pr[P|M] • Pr[ALIL AND M1I1M2M3|M] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] = ? CSE182-L8

  13. Formally • The emitted sequence is S=S1S2…Sm • The path traversed is P1P2P3.. • ej(s) = emission probability of symbol s in state Pj • Transition probability T[Pj,Pk] : Probability of transitioning from state Pj to state Pk. • Pr(P,S|M) = eP1(S1) T[P1,P2] eP2(S2) …… • Let P[k,n] = Pr(P1…Pk,S1…Sn|M) • Then, P[k,n] = P[k-1,n-1] T[Pk-1,Pk] ePk(Sn) • What is Pr(S|M)? CSE182-L8

  14. Two solutions • An unknown (hidden) path is traversed to produce (emit) the sequence S. • The probability that M emits S can be either • The sum over the joint probabilities over all paths. • Pr(S|M) = ∑P Pr(S,P|M) • OR, it is the probability of the most likely path • Pr(S|M) = maxP Pr(S,P|M) • Both are appropriate ways to model, and have similar algorithms to solve them. CSE182-L8

  15. Viterbi Algorithm for HMM A L - L A I V L A I - L • Let Pmax(i,j|M) be the probability of the most likely solution that emits S1…Si, and ends in state j (is it sufficient to compute this?) • Pmax(i,j|M) = max k Pmax(i-1,k) T[k,j] ej(Si) (Viterbi) • Psum(i,j|M) = ∑ k(Psum(i-1,k) T[k,j]) ej(Si) CSE182-L8

  16. Profile HMM membership A L - L A I V L A I - L • We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. • Backtracking can be used to get the path, which allows us to give an alignment A L I L Path:M1 M2 I2 M3 CSE182-L8

  17. Summary • To search for remote homologs (family members) of a protein sequence, we need to have alignments of “disparate protein sequences”. • HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. • Patterns/Profiles/HMMs allow us to represent families and foucs on key residues • Each has its advantages and disadvantages, and needs special algorithms to query efficiently. CSE182-L8

  18. Protein Domain databases HMM • A number of databases capture proteins (domains) using various representations • Each domain is also associated with structure/function information, parsed from the literature. • Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D CSE182-L8

  19. Sequence databases • We talked about DNA and protein sequence databases. • Next, we will talk about the connection between DNA and proteins • Proteins are the cellular machinery • DNA is the inherited material, and must contain the code for generating proteins • Solution of the genetic code was one of the great intellectual achievements of the last century CSE182-L8

  20. Gene Finding What is a Gene? CSE182-L8

  21. Gene • We define a gene as a location on the genome that codes for proteins. • The genic information is used to manufacture proteins through transcription, and translation. • There is a unique mapping from triplets to amino-acids CSE182-L8

  22. Eukaryotic gene structure CSE182-L8

  23. Translation • The ribosomal machinery reads mRNA. • Each triplet is translated into a unique amino-acid until the STOP codon is encountered. • There is also a special signal where translation starts, usually at the ATG (M) codon. CSE182-L8

  24. Translation • The ribosomal machinery reads mRNA. • Each triplet is translated into a unique amino-acid until the STOP codon is encountered. • There is also a special signal where translation starts, usually at the ATG (M) codon. • Given a DNA sequence, how many ways can you translate it? CSE182-L8

  25. ATG 5’ UTR 3’ UTR exon intron Translation start Acceptor Donor splice site Transcription start Gene Features CSE182-L8

  26. End of L8 CSE182-L8

  27. Gene identification • Eukaryotic gene definitions: • Location that codes for a protein • The transcript sequence(s) that encodes the protein • The protein sequence(s) • Suppose you want to know all of the genes in an organism. • This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. • All of that changed with better reagents and the development of high throughput methods like EST sequencing CSE182-L8

  28. ATG 5’ UTR 3’ UTR exon intron Translation start Acceptor Donor splice site Transcription start Computational Gene Finding • Given Genomic DNA, identify all the coordinates of the gene • TRIVIA QUIZ! What is the name of the FIRST gene finding program? (google testcode) CSE182-L8

  29. Gene Finding: The 1st generation • Given genomic DNA, does it contain a gene (or not)? • Key idea: The distributions of nucleotides is different in coding (translated exons) and non-coding regions. • Therefore, a statistical test can be used to discriminate between coding and non-coding regions. CSE182-L8

  30. Coding versus non-coding • You are given a collection of exons, and a collection of intergenic sequence. • Count the number of occurrences of ATGATG in Introns and Exons. • Suppose 1% of the hexamers in Exons are ATGATG • Only 0.01% of the hexamers in Intergenic are ATGATG • How can you use this idea to find genes? CSE182-L8

  31. Generalizing Frequencies (X10-5) I E X AAAAAA AAAAAC AAAAAG AAAAAT 10 10 5 5 20 10 • Compute a frequency count for all hexamers. • Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space • Use this to decide whether a sequence X is exonic/intergenic. CSE182-L8

  32. A geometric approach (2 hexamers) • Plot the following vectors • E= [10, 20] • I= [10, 5] • V3 = [6, 10] • V4 = [9, 15] • Is V3 more like E or more like I? E 20 15 V3 10 I 5 5 10 15 CSE182-L8

  33. Choosing between Intergenic and Exonic • V’ = V/||V|| • All vectors have the same length (lie on the unit circle) • Next, compute the angle to E, and I. • Choose the feature that is ‘closer’ (smaller angle. E V3 I CSE182-L8

  34. Coding versus non-coding signals • Fickett and Tung (1992) compared various measures • Measures that preserve the triplet frame are the most successful. • Genscan uses a 5th order Markov Model CSE182-L8

  35. 5th order markov chain A EXON • PrEXON[AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]…… = (20/78) (50/78)………. C AAAAC AAAAAA 20 1 AAAAAC 50 10 AAAAAG 5 30 AAAAAT 3 .. AAAAA G AAAAG CSE182-L8

  36. Scoring for coding regions • The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron. • In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions CSE182-L8

  37. Coding differential for 380 genes CSE182-L8

  38. Coding region can be detected • Plot the coding score using a sliding window of fixed length. • The (large) exons will show up reliably. • Not enough to predict gene boundaries reliably Coding CSE182-L8

  39. Other Signals • Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise. • When combined they can be effective ATG AG GT Coding CSE182-L8

  40. Combining Signals • We can compute the following: • E-score[i,j] • I-score[i,j] • D-score[i] • I-score[i] • Goal is to find coordinates that maximize the total score i j CSE182-L8

  41. The second generation of Gene finding • Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure. • It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low. • Problem with multiple genes in a genomic region CSE182-L8

  42. Combining signals using D.P. • An HMM is the best way to model and optimize the combination of signals • Here, we will use a simpler approach which is essentially the same as the Viterbi algorithm for HMMs, but without the formalism. CSE182-L8

  43. i1 i2 i3 i4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Hidden states & gene structure • Identifying a gene is equivalent to labeling each nucleotide as E/I/intergenic etc. • These ‘labels’ are the hidden states • For simplicity, consider only two states E and I CSE182-L8

  44. i1 i2 i3 i4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Gene finding reformulated • Given a labeling L, we can score it as • I-score[0..i1] + E-score[i1..i2] + D-score[i2+1] + I-score[i2+1..i3-1] + A-score[i3-1] + E-score[i3..i4] + ……. • Goal is to compute a labeling with maximum score. CSE182-L8

  45. Optimum labeling using D.P. (Viterbi) • Define VE(i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled E • Define VI(i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled I • Why is it enough to compute VE(i) & VI(i) ? CSE182-L8

  46. Optimum parse of the gene j i j i CSE182-L8

  47. Generalizing • Note that we deal with two states, and consider all paths that move between the two states. E I CSE182-L8 i

  48. Generalizing IG • We did not deal with the boundary cases in the recurrence. • Instead of labeling with two states, we can label with multiple states, • Einit, Efin, Emid, • I, IG (intergenic) I Efin Emid Note: all links are not shown here Einit CSE182-L8

  49. An HMM for Gene structure CSE182-L8

  50. Generalized HMMs, and other refinements • A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described • In standard HMMs, there is an exponential distribution on the duration of time spent in a state. • This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs. CSE182-L8

More Related