1 / 53

BCB 444/544

BCB 444/544. Lecture 27 Gene Prediction II #27_Oct24. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Promoter & Regulatory Element Prediction Chp 9 - pp 113 - 126

brigit
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 27 Gene Prediction II #27_Oct24 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  2. Required Reading (before lecture) MonOct 22- Lecture 26 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Promoter &Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25- Review Session & Project Planning Fri Oct 26 - EXAM 2 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  3. Assignments & Announcements Mon Oct 22 - Study Guide for Exam 2 was posted, finally… Mon Oct 22- HW#4 Due (no "correct" answer to post) Thu Oct 25 - no Lab => Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: • Lectures 13-26 (thru Mon Sept 17) • Labs 5-8 • HW# 3 & 4 • All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  4. BCB 544 "Team" Projects • 544 Extra HW#2 is next step in Team Projects • Write ~ 1 page outline • Schedule meeting with Michael & Drena to discuss topic • Read a few papers • Write a more detailed plan • You may work alone if you prefer • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3(no class that day) • Oral presentations (15-20') will be:Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period • See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  5. BCB 544 Only: New Homework Assignment 544 Extra#2(posted online Thurs?) No - sorry! sent by email on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  6. Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB • Dave SegalUC DavisZinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  7. Chp 8 - Gene Prediction SECTION IIIGENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  8. What is a Gene? What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" • Genes can encode: • mRNA (for protein) • other types of RNA (tRNA, rRNA, miRNA, etc.) • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  9. DN 5’ 3’ exon 1 intron exon 2 intron exon 3 3’ 5’ Transcription 1' transcript (RNA) 5’ 3’ Splicing (remove introns) 3’ 5’ Capping & polyadenylation Mature mRNA 5’ 7MeG AAAAA 3’ m Export to cytoplasm Synthesis & Processing of Eukaryotic mRNA Gene in DNA BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  10. insert vector What are cDNAs & ESTs? • cDNA libraries are important for determining gene • structure & studying regulation of gene expression • Isolate RNA (always from a specific • organism, region, and time point) • Convert RNA to complementary DNA • (with reverse transcriptase) • Clone into cDNA vector • Sequence the cDNA inserts • Short cDNAs are called ESTs or • Expressed Sequence Tags • ESTs are strong evidence for genes • Full-length cDNAs can be difficult to obtain BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  11. UniGene: Unique genes via ESTs • • Find UniGene at NCBI: • www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • • UniGene data come from many cDNA libraries. • When you look up a gene in UniGene, you can • obtain information re: level & tissue • distribution of expression BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  12. Prokaryotes Small genomes 0.5 - 10·106 bp About 90% of genome is coding Simple gene structure Prediction success ~99% Eukaryotes Large genomes 107 – 1010 bp Often less than 2% coding Complicated gene structure (splicing, long exons) Prediction success 50-95% Gene Prediction in Prokaryotes vs Eukaryotes Splice sites Start codon Stop codon ATG TAA ATG TAA 5’ UTR 3’ UTR Promotor Open reading frame (ORF) Promotor Exons Introns BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  13. Prediction is Easier in Microbial Genomes Why?Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm, Glimmer TIGRComprehensive Microbial Resource (CMR) NCBIMicrobial Genomes BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  14. Gene Prediction - The Problem Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  15. Computational Gene Prediction: Approaches • Ab initio methods • Search by signal: find DNA sequences involved in gene expression. • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity-based methods • Database search: exploit similarity to proteins, ESTs, cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  16. Computational Gene Prediction: Algorithms • Neural Networks (NNs)(more on these later…) e.g., GRAIL • Linear discriminant analysis (LDA)(see text) e.g., FGENES, MZEF • Markov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text) HMMgene - uses conditional maximum likelihood (see text) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  17. Gene Prediction Strategies • What sequence signals can be used? • Transcription:TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals:Splice donor/acceptors, polyA signal • Translation:Start (AUG = Met) & stop (UGA,UUA, UAG) • ORFs, codon usage • What other types of information can be used? • Homology (sequence comparison, BLAST) • cDNAs & ESTs(experimental data, pairwise alignment) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  18. Signals Search Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  19. DNA Signals Used in Gene Prediction • Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP • Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… • Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron • Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length • Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  20. Content Search Observation: Encoding a protein affects statistical properties of DNA sequence: • Nucleotide composition • Hexamer frequency • GC content (CpG islands, exon/intron) • Uneven usage of synonymous codons (codon bias) Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  21. Human Codon Usage BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  22. Predicting Genes based on Codon Usage Differences Exons Coding Profile of ß-globin gene Algorithm: Process sliding window • Use codon frequencies to compute probability of coding versus non-coding • Plot log-likelihood ratio: BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  23. Similarity-Based Methods: Database Search ATTGCGTAGGGCGCT TAACGCATCCCGCGA In different genomes:Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.). Problems: • Will not find “new” or RNA genes (non-coding genes). • Limits of similarity are hard to define • Small exons might be overlooked BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  24. Similarity-Based Methods: Comparative Genomics human mouse GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA- Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene Advantages: • May find uncharacterized or RNA genes Problems: • Finding suitable evolutionary distance • Finding limits of high similarity (functional regions) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  25. Human Mouse Human-Mouse Homology • Comparison of 1196 orthologous genes • Sequence identity between genes in human vs mouse • Exons: 84.6% • Protein: 85.4% • Introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  26. Gene Prediction Flowchart Fig 5.15 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  27. Predicting Genes - Basic steps: • Obtain genomic sequence • BLAST it! • Perform database similarity search • (with EST & cDNA databases, if available) • Translate in all 6 reading frames • (i.e., "6-frame translation") • Compare with protein sequence databases • Use Gene Prediction software to locate genes • Compare results obtained using different programs • Analyze regulatory sequences, too • Refine gene prediction BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  28. Predicting Genes - a few Details: • 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) • Perform database search on translatedDNA (BlastX,TFasta) • Use several programs to predict genes & find ORFs (GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL) • Search for functional motifs in translated ORFs & in neighboring DNA sequences (InterPro, Transfac) • Repeat BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  29. Thanks to Volker Brendel, ISU for the following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB V Brendel vbrendel@iastate.edu Brendel et al (2004)Bioinformatics 20: 1157 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  30. GeneSeqer Genomic Sequence Fast Search Spliced Alignment EST or protein database (Suffix Array/Suffix Tree) Output Assembly Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  31. Spliced Alignment Algorithm Intron GT AG Donor Acceptor Splice sites GeneSeqer- Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel et al (2004)Bioinformatics 20: 1157 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157 • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using Bayesian probability model & 1st order MM • Score coding constraints in translated exons • Using Bayesian model Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  32. Signals: Pre-mRNA Splicing Start codon Stop codon Genomic DNA Transcription pre-mRNA Cap- -Poly(A) Splicing mRNA -Poly(A) Cap- Translation Protein EXON INTRON GT AG Acceptor site Donor site Splice sites Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  33. Brendel - Spliced Alignment I: Compare with cDNA or EST probes Start codon Stop codon Genomic DNA Start codon Stop codon -Poly(A) mRNA Cap- 5’-UTR 3’-UTR Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  34. Brendel - Spliced Alignment II: Compare with protein probes Start codon Stop codon Genomic DNA Protein Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  35. Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  36. Human T2_GT Human T2_AG Information Content vs Position Which sequences are exons & which are introns? How can you tell? Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  37. Species Type Number of True Splice Sites / Phase 1 2 3 Home sapiens GT AG 6586 6555 5277 5194 3037 2979 Mus musculus GT AG 1212 1194 1185 1139 521 504 Rattus norvegicus GT AG 450 442 408 386 147 140 Gallus gallus GT AG 288 284 238 228 107 103 Drosophila GT AG 989 1001 670 671 524 536 C. elegans GT AG 37029 36864 20500 20325 20789 20626 S. pombe GT AG 170 179 118 122 119 118 Aspergillus GT AG 221 217 176 172 157 163 Arabidopsis thaliana GT AG 23019 22929 9297 9247 8653 8611 Zea mays GT AG 316 311 107 104 88 83 Donor (GT) & Acceptor (AG) Sites Used for Model Training Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  38. PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  39. Actual True False TP FP PP=TP+FP True Predicted FN TN False PN=FN+TN AP=TP+FN AN=FP+TN • Specificity: • Sensitivity: • Misclassification rates: • Normalized specificity: Evaluation of Predictions Predicted Positives True Positives False Positives Coverage Recall Do not memorize this! BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  40. Actual True False TP FP PP=TP+FP True Predicted FN TN False PN=FN+TN AP=TP+FN AN=FP+TN • Sensitivity: • Specificity: Evaluation of Predictions - in English = Coverage IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be trivially achieved by labeling all test cases positive! In English? Sensitivity is the fraction of all positive instances having a true positive prediction. = Recall IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value") In English? Specificity is the fraction of all predicted positives that are, in fact, true positives. BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  41. Best Measures for Comparison? • ROC curves(Receiver Operating Characteristic (?!!) • http://en.wikipedia.org/wiki/Roc_curve • Correlation Coefficient • Matthews correlation coefficient (MCC) • MCC = 1 for a perfect prediction • 0 for a completely random assignment • -1 for a "perfectly incorrect" prediction In signal detection theory, a receiver operating characteristic (ROC),or ROC curve is aplot of sensitivity vs (1 - specificity)for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently byplotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate) Do not memorize this! BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  42.  Human GT site Human AG site Sn Sn   A. thaliana AG site A. thaliana GT site Sn Sn GenSeqer Performance? • Plots such as these (& ROCs) are much better than using a "single number" to compare different methods • Such plots illustrate trade-off: Sn vs Sp • Note: the above are not ROC curves (plots of Sn vs 1-Sp) Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  43. Species Model Site Test Site Set True False Bayes Factor Sn (%)  (%) Sp (%) Homo sapiens 2C GT AG 921 920 44411 65103 0 3 6 0 3 6 98.5 91.7 66.3 96.3 90.3 76.1 90.5 96.3 98.5 88.4 92.9 96.1 16.4 34.8 57.6 9.7 15.7 25.6 Drosophila 2C GT AG 329 329 11501 14920 0 3 6 0 3 6 95.4 90.0 83.9 95.7 92.1 85.1 94.8 97.6 99.1 94.8 97.0 98.5 34.1 53.6 75.0 28.7 41.4 59.4 C. elegans 7C GT AG 400 400 7460 10132 0 3 6 0 3 6 97.8 94.2 84.8 98.8 96.2 90.2 92.7 97.1 99.1 97.2 98.8 99.5 40.4 64.3 85.4 58.2 76.9 88.5 A. thaliana 7C GT AG 613 614 9027 10196 0 3 6 0 3 6 99.5 95.6 87.1 99.2 96.4 87.1 93.2 97.6 99.3 92.3 96.4 98.6 48.1 73.2 91.0 41.9 62.0 81.2 GeneSeqer Results on Different Genomes Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  44. Performance of GeneSeqer vs Others? • Comparison with ab initio gene prediction: vsGENSCAN an HMM-based ab initio method • "Winner" depends on: • Availability of ESTs • Level of similarity to protein homologs Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  45. GeneSeqer vs GENSCAN (Exon prediction) 1.00 0.90 0.80 0.70 Exon (Sn + Sp) / 2 0.60 0.50 0.40 GeneSeqer 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  46. 1.00 0.90 0.80 0.70 0.60 Intron (Sn + Sp) / 2 0.50 GeneSeqer 0.40 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GeneSeqer vs GENSCAN (Intron prediction) GENSCAN - Burge, MIT Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  47. GeneSeqer: Input http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  48. GeneSeqer: Output Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  49. GeneSeqer: Gene Evidence Summary Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

  50. Gene Prediction - Problems & Status? Common errors? • False positive intergenic regions: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • Partially incorrect gene annotation • Missing annotation of alternative transcripts Current status? • For ab initio prediction in eukaryotes:HMMs have better overall performance for detecting untron/exon boundaries • Limitation? Training data: predictions are organism specific • Combined ab initio/homology based predictions: Improved accurracy • Limitation? Availability of identifiable sequence homologs in databases BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II

More Related