1 / 39

From Genomes to Genes

From Genomes to Genes. Rui Alves. …atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgtttgaagaaattcc…. How to make sense of genome sequences?. How do I know where genes are?. Predicting ORFs is easy, predicting genes is hard.

eldon
Download Presentation

From Genomes to Genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Genomes to Genes Rui Alves

  2. …atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgtttgaagaaattcc……atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgtttgaagaaattcc… How to make sense of genome sequences? How do I know where genes are?

  3. Predicting ORFs is easy, predicting genes is hard • An ORF is a sequence of nucleotides that goes from a start codon (ATG, GTG,…) to a stop codon (GTA) • Finding them is as easy as reading the DNA sequence • How do we know if an ORF is a gene?

  4. There are several ways to predict genes • By homology

  5. Sequence of known gene …Sequenced … Genome… Homologue gene Homology predictions

  6. How are sequences aligned? Substitution probability table …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… S2>S1 S1

  7. NO HOMOLOGY!! Problems of homology predictions: The genetic code …UUAAUUUCCCGUCCG… …CUUAUAAGUAGACCA… Yet, the code is for the same peptide …LISRP…

  8. Solution for redundancy of genetic code: Use synonymous substitution when doing the DNA alignment The problem of doing this: …UUAAUUUCCCGUCCG… …UUAAUUUCCCGUCCA… …UUAAUUUCCAGACCG… … …CUUAUAAGUAGACCA… Combinatorial Explosion!!! Solutions? Not many, efficient algorithms, more computer power, pacience

  9. Homology predictions most effective for closely related organisms Thus, homology-based gene predictions works best when the genome of a close organism has been fully sequenced and annotated!!!

  10. There are other ways to predict if Orfs are genes • By homology • Ab initio methods • Signal Sensors • ATG sites • Promoter elements id • Regulatory elements id • Shine-Dalgarno sequences id (i.e. rybosome binding sites) • …

  11. Using initiation and termination codons to identify ORFs • ATG is the start codon • GTG, CTG, TTG are minor start codons • If termination codon too close to ATG then ORFs unlikely to be gene atgaatgaatgctgccgaagatctctggcaccaaattttggagcggttgcag… atgaatgaatgctgccgaagatctctggcaccaaattttggagcggtgacag…

  12. Using Promoter sequences to identify ORFs • Many promoters have a known structure • Identifying Promoters close to initiation codons increases likelihood of ORF being gene Lac promoter

  13. Using response elements to identify ORFs • Regulatory binding sites (RBS) have a known structure • Identifying RBS close to initiation codons increases likelihood of ORF being gene

  14. AGGAGG Consensus Shine-Dalgarno sequence Using Rybosomal binding sequences to identify ORFs • Rybosomal binding sites (SDS) have a known structure • Identifying SDS close to initiation codons increases likelihood of ORF being gene

  15. There are several ways to predict genes • By homology • Ab initio methods • Signal Sensors • Promoter elements id • Regulatory elements id • Shine-Dalgarno sequences id (i.e. rybosome binding sites) • ATG sites • … • Content Sensors • Codon usage • GC content • Position assymetry • CpG islands • …

  16. Using codon bias to predict expressed ORFs • Frequency of synonymous codons in an organism are not uniform • Frequency of synonymous codons in coding sequences is different from that in non-coding sequences • This can be used to predict coding open reading frames atgaatgcatgctgccgaagatctctggcaccaaattttggagcggttgcag… The third reading frame is the most likely to be a gene

  17. Using GC content to predict expressed ORFs gtgattagctctgccgaagatctctggcaccaaattttggagcggttgcag… The G+C content of the third position of codons in coding sequences is biased Genes have a very high (low) G+C content on the third position of the codons in the reading frame. Frame 1 (3) more likely to be expressed Not very usefull for eukaryotes

  18. Using position assymetry to predict expressed ORFs Coding sequences have a characteristic distribution of nucleotides in each of the three positions of codons gtgaatgtatgctctgccgaagatctctggcaccaaattttggagcggttgcag…

  19. Using position assymetry to predict expressed ORFs Reading Frame 1 the most likely because it has the highest similarity to the position assymetry of known genes.

  20. CpG Islands are signals for transcription initiation Near the promoter of known genes, the content of CG dinucleotides is higher than that away from initiation of transcription sites Thus, ATG preceded by CpG island are more likely to be genes

  21. Other assimetry measures of gene likelihood Dinucleotide bias Hexanucleotide bias …

  22. Summary • Genes can be predicted by • Homology • Content sensors • Signal sensors If you need to annotate a genome, e.g. go to TIGR

  23. Ryb How are eukaryotic genes different? DNA mRNA RNA Pol Protein

  24. mRNA mRNA mRNA mRNA Ryb How are eukaryotic genes different? DNA RNA Pol Spliceosome Protein Correctly Identifying Splicing sites is not a trivial task

  25. How do we predict splicing sites? • By Homology • Ab initio • SS motifs • Codon usage • Exonic Splicing Enhancers • Intronic Splicing Enhancers • Exonic Splicing Silencers • Intronic Splicing Silencers

  26. Known Predicted spliced spliced gene gene Homology Splice Site Prediction

  27. Splice Site Motifs

  28. Exonic Splicing Enhancers

  29. Exonic Splicing Silencers Genes & Development 18:1241-1250

  30. Interaction between SE and SI

  31. Rules for Splicing 3’ end likely target for repression Distance between SE and 3’ end < 100bp Splicing efficiencyap(interaction SEC-3’ end)

  32. Methods for splicing detection Training set of know spliced genes Test set of know spliced genes Set of know spliced genes Algorithm GA, NN, HMM Bayes,ME GA, NN, HMM Bayesian Test set Predictions

  33. A Genetic Algorithm Method Shuffle lines and columns k times and each time calculate the probability of a given combination of motifs getting spliced Select m best combinations and continue to evolve the algorithm until it predicts training set

  34. A Neural Net Method Sequences Corrected Weight Table for splice elements Weight Table for splice elements Hidden Nodes Predicted Splicing

  35. Summary Eukaryotic genes have exons Biological rules combined with mathematical and statistical approaches can be used to predict the boundaries for the exons and to predict the splice variants

  36. How to find what genes a string of DNA contains Rui Alves

  37. Simple steps Go to a known gene prediction server (or google for one) Input sequence and wait for prediction Get prediction(s), either as cDNA or as a tranlated protein sequence and do homology searches to identify them in a known database (e.g. NCBI or SWISSPROT)

  38. Simple steps a) Go to a known gene prediction server (or google for one) Input sequence and wait for prediction Get prediction(s), either as cDNA or as a translated protein sequence and do homology searches to identify them

  39. Paper Presentation The human genome (Science) vs. The human genome (Nature) Nature : Pages 875 to 901 Science: Pages 1317-1337 Compare the differences in methods and results for the annotation DO NOT SPEND TIME TALKING ABOUT THE SEQUENCING OR ASSEMBLY ITSELF Do not go into the comparative genome analysis

More Related