1 / 29

Gene Prediction

Doug Raiford Lesson 3. Gene Prediction. What’s the problem. Have a fully sequenced genome How identify the genes? What do we know so far?. Look for start and stop codons. Remember Start codon codes for methionine Stop codons do not code for an amino acid

Download Presentation

Gene Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Doug Raiford Lesson 3 Gene Prediction

  2. What’s the problem • Have a fully sequenced genome • How identify the genes? • What do we know so far? Gene Prediction

  3. Look for start and stop codons • Remember • Start codon codes for methionine • Stop codons do not code for an amino acid • Does every ATG mark the beginning of a gene? • Does every TAG, TAA, or TGA mark the end? Start codon: ATG Stop codons: TAG, TAA, or TGA Gene Prediction

  4. In frame • The start and stop codons must be “in frame” • A set of codons must fit between them • Length evenly divisible by three • Open reading frame • Series of codons bracketed by start and stop codons (in frame) Gene Prediction

  5. Gene length • The distance between start and stop codons tends to be longer than expected • How long would we expect that distance to be? Gene Prediction

  6. Randomly drawn nucleotides • There are 64 different codons • A given codon should show-up randomly around once every 64 codons or 192 nts (64*3) • 3 stop codons • Expect 3 in every 64 codons or once every 21 1/3 codons(21 1/3 * 3 = 64 nts) Gene Prediction

  7. How far beyond 64? • Number of genes in E. coli is 4356 • Min 44 nts, max 8621 • 8 are < 64 • 143 < 128 (3%) • Good start but must be more • Approximately 77,000 ORFs > 2* expected on each strand Escherichia coli Gene Prediction

  8. Parts of a gene • To “find” a gene would look for nt sequences that look like the parts of a gene RNA polymerase Promoter Region Coding region Terminator Region Start Codon ‘ATG’ = Methionine Stop Codon: non coding ‘TAA’, ‘TAG’, or ‘TGA’ Gene Prediction

  9. Upstream region • Attract polymerase • Specific sequences • Gene regulation • Each promoter has unique pattern • Motifs for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Polymerase binding Start Codon Transcription start site Gene Prediction

  10. Genes of a feather… • Slightly different -35 and -10 motifs attract different sigma factors • Genes with similar upstream regions tend to be related: they express similarly Gene Prediction

  11. Termination region (downstream) • Hairpin • Followed by U-run (A-run in the DNA) Gene Prediction

  12. Termination • Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase mRNA Polymerase UUUUUUU DNA AAAAAAAA Gene Prediction

  13. Motifs • How find? • Difficult: fuzzy, not carved in stone for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Polymerase binding Start Codon Transcription start site Gene Prediction

  14. Motiffs • Hidden Markov Models often used • All about the statistics • Markov Chain: series of events along with probabilities G or C or A or T Start Yay! I found one T A T A A T A Gene Prediction

  15. Hidden Markov Models • Previous was a “state machine” representation • Should have states and observations • The states are “hidden” 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction

  16. Probabilities • Each state has a probability of “emitting” any given observation • Each state has a probability of “transitioning” to any given next state 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction

  17. Representing a model: two matrices • Transition probability matrix • Rows represent current state • Columns represent state to which a transition will occur • Entry is the probability associated with that transition • Emission probability matrix • Rows represent states • Columns represent which observation is emitted • Entry is the probability associated with that emission TRANS EMIS Gene Prediction

  18. General approach • Requires a subject matter expert to build a model • Often start with a state for each position in a possible match • Example looking for something similar to • TATAAT • Might not have both A’s • Might have extra one in first slot • Never have G’s or C’s 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 A T A T A T Gene Prediction

  19. Model • Also need a state for non-participating regions 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction

  20. Model • First guess as to probabilities • Maybe from state associated with first T to A 100% • Then 50% 50% whether A or T • Then 50% 50% whether A or T • Then 100% T 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 A T A T A T Gene Prediction

  21. Can “train” the model • Baum-Welch or Viterbi algorithm • Pass the algorithm a sequence of observations and first guess as to probabilities • It refines the probability matrices • Assumes that the sequence adheres to the • underlying probabilities. • Traverses states keeping track of actual • frequency of emissions and transitions • Adjusts matrices accordingly Gene Prediction

  22. Using the model • Called checking the posterior probabilities • Given a sequence, check all possible paths through the model • Multiply the associated probabilities • Path with the highest probability is likely the path through the hidden states • Can use the “forward algorithm” to cut down the number of paths (dynamic programming) • Location in sequence where most probable states are “TATAAT” is a match A C G T A C G T .25 .25 .25 .25 .25 .25 .25 .25 1 1 0 1 2 3 4 5 1 1 1/17 1 1 1 1 16/17 1 T A T A Gene Prediction

  23. Example using Matlab • Matlab very useful at matrix operations seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0; .25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] A C G T A C G T .25 .25 .25 .25 .25 .25 .25 .25 1 1 0 1 2 3 4 5 1 1 1/17 1 1 1 1 16/17 1 T A T A Gene Prediction

  24. Sites • Gene mark georgia institute • http://exon.biology.gatech.edu/ • Genscan • http://genes.mit.edu/GENSCAN.html • Genie Berkeley • http://www.fruitfly.org/seq_tools/genie.html • Glimmer university of maryland • http://www.cbcb.umd.edu/software/GlimmerHMM/ Gene Prediction

  25. Elaborate models • Can include all regions in the model • States for each position in each region • Coding region could be simple set of three regions for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Termination region Polymerase binding Start Codon Transcription start site Gene Prediction

  26. Used in many applications • Classic example: states are rainy or sunny • If know whether someone is walking, shopping or cleaning, can predict state states Emissions Observations Gene Prediction

  27. Gene Prediction

  28. State is hidden • If something that is observable is dependent on an underlying state can use HMM • In motifs sequence is visible, whether or not a region is a promoter site is not Gene Prediction

  29. Probabilities • Each state has a probability of emitting any given observation • Each state has a probability of transitioning to any given next state Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities Gene Prediction

More Related