Bcb 444 544
Download
1 / 53

BCB 444/544 - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

BCB 444/544. Lecture 27 Gene Prediction II #27_Oct24. Required Reading ( before lecture). Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Promoter & Regulatory Element Prediction Chp 9 - pp 113 - 126

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' BCB 444/544' - brigit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Bcb 444 544
BCB 444/544

Lecture 27

Gene Prediction II

#27_Oct24

BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


Required reading before lecture
Required Reading (before lecture)

MonOct 22- Lecture 26

Gene Prediction

  • Chp 8 - pp 97 - 112

    Wed Oct 24 - Lecture 27 (will not be covered on Exam 2)

    Promoter &Regulatory Element Prediction

  • Chp 9 - pp 113 - 126

    Thurs Oct 25- Review Session & Project Planning

    Fri Oct 26 - EXAM 2

BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


Assignments announcements
Assignments & Announcements

Mon Oct 22 - Study Guide for Exam 2 was posted, finally…

Mon Oct 22- HW#4 Due

(no "correct" answer to post)

Thu Oct 25 - no Lab => Optional Review Session for Exam

544 Project Planning/Consult with DD & MT

Fri Oct 26 - Exam 2 - Will cover:

  • Lectures 13-26 (thru Mon Sept 17)

  • Labs 5-8

  • HW# 3 & 4

  • All assigned reading:

    Chps 6 (beginning with HMMs), 7-8, 12-16

    Eddy: What is an HMM

    Ginalski: Practical Lessons…

BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


Bcb 544 team projects
BCB 544 "Team" Projects

  • 544 Extra HW#2 is next step in Team Projects

    • Write ~ 1 page outline

    • Schedule meeting with Michael & Drena to discuss topic

    • Read a few papers

    • Write a more detailed plan

  • You may work alone if you prefer

  • Last week of classes will be devoted to Projects

  • Written reports due: Mon Dec 3(no class that day)

  • Oral presentations (15-20') will be:Wed-Fri Dec 5,6,7

    • 1 or 2 teams will present during each class period

  • See Guidelines for Projects posted online

  • BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Bcb 544 only new homework assignment
    BCB 544 Only: New Homework Assignment

    544 Extra#2(posted online Thurs?)

    No - sorry! sent by email on Sat…

    Due: PART 1 - ASAP

    PART 2 - Fri Nov 2 by 5 PM

    Part 1 - Brief outline of Project, email to Drena & Michael

    after response/approval, then:

    Part 2 - More detailed outline of project

    Read a few papers and summarize status of problem

    Schedule meeting with Drena & Michael to discuss ideas

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Seminars this week
    Seminars this Week

    BCB List of URLs for Seminars related to Bioinformatics:

    http://www.bcb.iastate.edu/seminars/index.html

    • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB

      • Dave SegalUC DavisZinc Finger Protein Design

    • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI

      • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Chp 8 gene prediction
    Chp 8 - Gene Prediction

    SECTION IIIGENE AND PROMOTER PREDICTION

    Xiong: Chp 8 Gene Prediction

    • Categories of Gene Prediction Programs

    • Gene Prediction in Prokaryotes

    • Gene Prediction in Eukaryotes

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    What is a Gene?

    What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory"

    • Genes can encode:

      • mRNA (for protein)

      • other types of RNA (tRNA, rRNA, miRNA, etc.)

  • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation

  • BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    DN

    5’

    3’

    exon 1

    intron

    exon 2

    intron

    exon 3

    3’

    5’

    Transcription

    1' transcript (RNA)

    5’

    3’

    Splicing (remove introns)

    3’

    5’

    Capping & polyadenylation

    Mature mRNA

    5’

    7MeG

    AAAAA 3’

    m

    Export to cytoplasm

    Synthesis & Processing of Eukaryotic mRNA

    Gene in DNA

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    insert

    vector

    What are cDNAs & ESTs?

    • cDNA libraries are important for determining gene

    • structure & studying regulation of gene expression

    • Isolate RNA (always from a specific

    • organism, region, and time point)

    • Convert RNA to complementary DNA

    • (with reverse transcriptase)

    • Clone into cDNA vector

    • Sequence the cDNA inserts

    • Short cDNAs are called ESTs or

    • Expressed Sequence Tags

    • ESTs are strong evidence for genes

    • Full-length cDNAs can be difficult to obtain

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    UniGene: Unique genes via ESTs

    • • Find UniGene at NCBI:

    • www.ncbi.nlm.nih.gov/UniGene

    • UniGene clusters contain many ESTs

    • • UniGene data come from many cDNA libraries.

    • When you look up a gene in UniGene, you can

    • obtain information re: level & tissue

    • distribution of expression

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Gene prediction in prokaryotes vs eukaryotes

    Prokaryotes

    Small genomes 0.5 - 10·106 bp

    About 90% of genome is coding

    Simple gene structure

    Prediction success ~99%

    Eukaryotes

    Large genomes 107 – 1010 bp

    Often less than 2% coding

    Complicated gene structure (splicing, long exons)

    Prediction success 50-95%

    Gene Prediction in Prokaryotes vs Eukaryotes

    Splice sites

    Start codon

    Stop codon

    ATG

    TAA

    ATG

    TAA

    5’ UTR

    3’ UTR

    Promotor

    Open reading frame (ORF)

    Promotor

    Exons

    Introns

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Prediction is easier in microbial genomes
    Prediction is Easier in Microbial Genomes

    Why?Smaller genomes

    Simpler gene structures

    Many more sequenced genomes!

    (for comparative approaches)

    Many microbial genomes have been fully sequenced &

    whole-genome "gene structure" and "gene function"

    annotations are available

    e.g., GeneMark.hmm, Glimmer

    TIGRComprehensive Microbial Resource (CMR)

    NCBIMicrobial Genomes

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Gene prediction the problem
    Gene Prediction - The Problem

    Problem:

    Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences

    ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

    ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Computational gene prediction approaches
    Computational Gene Prediction: Approaches

    • Ab initio methods

      • Search by signal: find DNA sequences involved in gene expression.

      • Search by content: Test statistical properties distinguishing coding from non-coding DNA

    • Similarity-based methods

      • Database search: exploit similarity to proteins, ESTs, cDNAs

      • Comparative genomics: exploit aligned genomes

        • Do other organisms have similar sequence?

    • Hybrid methods - best

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Computational gene prediction algorithms
    Computational Gene Prediction: Algorithms

    • Neural Networks (NNs)(more on these later…)

      e.g., GRAIL

    • Linear discriminant analysis (LDA)(see text)

      e.g., FGENES, MZEF

    • Markov Models (MMs) & Hidden Markov Models (HMMs)

      e.g., GeneSeqer - uses MMs

      GENSCAN - uses 5th order HMMs - (see text)

      HMMgene - uses conditional maximum likelihood (see text)

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Gene prediction strategies
    Gene Prediction Strategies

    • What sequence signals can be used?

    • Transcription:TF binding sites, promoter, initiation site, terminator, GC islands, etc.

    • Processing signals:Splice donor/acceptors, polyA signal

    • Translation:Start (AUG = Met) & stop (UGA,UUA, UAG)

    • ORFs, codon usage

    • What other types of information can be used?

    • Homology (sequence comparison, BLAST)

    • cDNAs & ESTs(experimental data, pairwise alignment)

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Signals search
    Signals Search

    Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Dna signals used in gene prediction
    DNA Signals Used in Gene Prediction

    • Exploit the regular gene structure

      ATG—Exon1—Intron1—Exon2—…—ExonN—STOP

    • Recognize “coding bias”

      CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…

    • Recognize splice sites

      Intron—cAGt—Exon—gGTgag—Intron

    • Model the duration of regions

      Introns tend to be much longer than exons, in mammals

      Exons are biased to have a given minimum length

    • Use cross-species comparison

      Gene structure is conserved in mammals

      Exons are more similar (~85%) than introns

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Content search
    Content Search

    Observation: Encoding a protein affects statistical properties of DNA sequence:

    • Nucleotide composition

    • Hexamer frequency

    • GC content (CpG islands, exon/intron)

    • Uneven usage of synonymous codons (codon bias)

      Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Human codon usage
    Human Codon Usage

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Predicting genes based on codon usage differences
    Predicting Genes based on Codon Usage Differences

    Exons

    Coding Profile of ß-globin gene

    Algorithm:

    Process sliding window

    • Use codon frequencies to compute probability of coding versus non-coding

    • Plot log-likelihood ratio:

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Similarity based methods database search
    Similarity-Based Methods: Database Search

    ATTGCGTAGGGCGCT

    TAACGCATCCCGCGA

    In different genomes:Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.)

    Within same genome: Search with EST/cDNA database

    (EST2genome, BLAT, etc.).

    Problems:

    • Will not find “new” or RNA genes (non-coding genes).

    • Limits of similarity are hard to define

    • Small exons might be overlooked

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Similarity based methods comparative genomics
    Similarity-Based Methods: Comparative Genomics

    human

    mouse

    GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA

    | ||||| ||||| ||| ||||| ||||||||||||| | |

    C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

    Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene

    Advantages:

    • May find uncharacterized or RNA genes

      Problems:

    • Finding suitable evolutionary distance

    • Finding limits of high similarity (functional regions)

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Human

    Mouse

    Human-Mouse Homology

    • Comparison of 1196 orthologous genes

    • Sequence identity between genes in human vs mouse

      • Exons: 84.6%

      • Protein: 85.4%

      • Introns: 35%

      • 5’ UTRs: 67%

      • 3’ UTRs: 69%

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Gene prediction flowchart
    Gene Prediction Flowchart

    Fig 5.15

    Baxevanis & Ouellette 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Predicting genes basic steps
    Predicting Genes - Basic steps:

    • Obtain genomic sequence

    • BLAST it!

      • Perform database similarity search

        • (with EST & cDNA databases, if available)

      • Translate in all 6 reading frames

        • (i.e., "6-frame translation")

      • Compare with protein sequence databases

    • Use Gene Prediction software to locate genes

    • Compare results obtained using different programs

    • Analyze regulatory sequences, too

    • Refine gene prediction

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Predicting genes a few details
    Predicting Genes - a few Details:

    • 1. 1st, mask to "remove" repetitive elements (ALUs, etc.)

    • Perform database search on translatedDNA (BlastX,TFasta)

    • Use several programs to predict genes & find ORFs (GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL)

    • Search for functional motifs in translated ORFs & in neighboring DNA sequences (InterPro, Transfac)

    • Repeat

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Thanks to volker brendel isu for the following figs slides

    Thanks to Volker Brendel, ISU for the following Figs & Slides

    Slightly modified from:

    BSSI Genome Informatics Module

    http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB

    V Brendel [email protected]

    Brendel et al (2004)Bioinformatics 20: 1157

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    GeneSeqer

    Genomic Sequence

    Fast Search

    Spliced Alignment

    EST or protein database

    (Suffix Array/Suffix Tree)

    Output

    Assembly

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Spliced alignment algorithm
    Spliced Alignment Algorithm

    Intron

    GT AG

    Donor

    Acceptor

    Splice sites

    GeneSeqer- Brendel et al.- ISU

    http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

    Brendel et al (2004)Bioinformatics 20: 1157

    http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157

    • Perform pairwise alignment with large gaps in one sequence (due to introns)

      • Align genomic DNA with cDNA, ESTs, protein sequences

    • Score semi-conserved sequences at splice junctions

      • Using Bayesian probability model & 1st order MM

    • Score coding constraints in translated exons

      • Using Bayesian model

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Signals: Pre-mRNA Splicing

    Start codon

    Stop codon

    Genomic DNA

    Transcription

    pre-mRNA

    Cap-

    -Poly(A)

    Splicing

    mRNA

    -Poly(A)

    Cap-

    Translation

    Protein

    EXON

    INTRON

    GT AG

    Acceptor site

    Donor site

    Splice sites

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Brendel - Spliced Alignment I:

    Compare with cDNA or EST probes

    Start codon

    Stop codon

    Genomic DNA

    Start codon

    Stop codon

    -Poly(A)

    mRNA

    Cap-

    5’-UTR

    3’-UTR

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Brendel - Spliced Alignment II:

    Compare with protein probes

    Start codon

    Stop codon

    Genomic DNA

    Protein

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    • Extent of Splice Signal Window:

    Splice Site Detection

    Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?

    YES

    i: ith position in sequence

    Ī: avg information content over all positions >20 nt from splice site

    Ī: avg sample standard deviation of Ī

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Human

    T2_GT

    Human

    T2_AG

    Information Content vs Position

    Which sequences are exons & which are introns?

    How can you tell?

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Species

    Type

    Number of True Splice Sites / Phase

    1 2 3

    Home sapiens

    GT

    AG

    6586

    6555

    5277

    5194

    3037

    2979

    Mus musculus

    GT

    AG

    1212

    1194

    1185

    1139

    521

    504

    Rattus norvegicus

    GT

    AG

    450

    442

    408

    386

    147

    140

    Gallus gallus

    GT

    AG

    288

    284

    238

    228

    107

    103

    Drosophila

    GT

    AG

    989

    1001

    670

    671

    524

    536

    C. elegans

    GT

    AG

    37029

    36864

    20500

    20325

    20789

    20626

    S. pombe

    GT

    AG

    170

    179

    118

    122

    119

    118

    Aspergillus

    GT

    AG

    221

    217

    176

    172

    157

    163

    Arabidopsis thaliana

    GT

    AG

    23019

    22929

    9297

    9247

    8653

    8611

    Zea mays

    GT

    AG

    316

    311

    107

    104

    88

    83

    Donor (GT) & Acceptor (AG) Sites

    Used for Model Training

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    PG

    PG

    (1-PG)(1-PD(n+1))

    en

    en+1

    (1-PG)PD(n+1)

    PA(n)PG

    (1-PG)PD(n+1)

    in

    in+1

    1-PA(n)

    Markov Model for Spliced Alignment

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Actual

    True

    False

    TP

    FP

    PP=TP+FP

    True

    Predicted

    FN

    TN

    False

    PN=FN+TN

    AP=TP+FN

    AN=FP+TN

    • Specificity:

    • Sensitivity:

    • Misclassification rates:

    • Normalized specificity:

    Evaluation of Predictions

    Predicted

    Positives

    True

    Positives

    False

    Positives

    Coverage

    Recall

    Do not memorize this!

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Actual

    True

    False

    TP

    FP

    PP=TP+FP

    True

    Predicted

    FN

    TN

    False

    PN=FN+TN

    AP=TP+FN

    AN=FP+TN

    • Sensitivity:

    • Specificity:

    Evaluation of Predictions - in English

    = Coverage

    IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be trivially achieved by labeling all test cases positive!

    In English? Sensitivity is the fraction of all positive instances having a true positive prediction.

    = Recall

    IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value")

    In English? Specificity is the fraction of all predicted positives that are, in fact, true positives.

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Best Measures for Comparison?

    • ROC curves(Receiver Operating Characteristic (?!!)

      • http://en.wikipedia.org/wiki/Roc_curve

  • Correlation Coefficient

    • Matthews correlation coefficient (MCC)

    • MCC = 1 for a perfect prediction

    • 0 for a completely random assignment

    • -1 for a "perfectly incorrect" prediction

  • In signal detection theory, a receiver operating characteristic (ROC),or ROC curve is aplot of sensitivity vs (1 - specificity)for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently byplotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate)

    Do not memorize this!

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Human

    GT site

    Human

    AG site

    Sn

    Sn

    A. thaliana

    AG site

    A. thaliana

    GT site

    Sn

    Sn

    GenSeqer Performance?

    • Plots such as these (& ROCs) are much better than using a "single number" to compare different methods

    • Such plots illustrate trade-off: Sn vs Sp

    • Note: the above are not ROC curves (plots of Sn vs 1-Sp)

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Species

    Model

    Site

    Test Site Set

    True False

    Bayes

    Factor

    Sn

    (%)

    (%)

    Sp

    (%)

    Homo sapiens

    2C

    GT

    AG

    921

    920

    44411

    65103

    0

    3

    6

    0

    3

    6

    98.5

    91.7

    66.3

    96.3

    90.3

    76.1

    90.5

    96.3

    98.5

    88.4

    92.9

    96.1

    16.4

    34.8

    57.6

    9.7

    15.7

    25.6

    Drosophila

    2C

    GT

    AG

    329

    329

    11501

    14920

    0

    3

    6

    0

    3

    6

    95.4

    90.0

    83.9

    95.7

    92.1

    85.1

    94.8

    97.6

    99.1

    94.8

    97.0

    98.5

    34.1

    53.6

    75.0

    28.7

    41.4

    59.4

    C. elegans

    7C

    GT

    AG

    400

    400

    7460

    10132

    0

    3

    6

    0

    3

    6

    97.8

    94.2

    84.8

    98.8

    96.2

    90.2

    92.7

    97.1

    99.1

    97.2

    98.8

    99.5

    40.4

    64.3

    85.4

    58.2

    76.9

    88.5

    A. thaliana

    7C

    GT

    AG

    613

    614

    9027

    10196

    0

    3

    6

    0

    3

    6

    99.5

    95.6

    87.1

    99.2

    96.4

    87.1

    93.2

    97.6

    99.3

    92.3

    96.4

    98.6

    48.1

    73.2

    91.0

    41.9

    62.0

    81.2

    GeneSeqer Results on Different Genomes

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Performance of geneseqer vs others
    Performance of GeneSeqer vs Others?

    • Comparison with ab initio gene prediction:

      vsGENSCAN an HMM-based ab initio method

    • "Winner" depends on:

      • Availability of ESTs

      • Level of similarity to protein homologs

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    GeneSeqer vs GENSCAN

    (Exon prediction)

    1.00

    0.90

    0.80

    0.70

    Exon (Sn + Sp) / 2

    0.60

    0.50

    0.40

    GeneSeqer

    0.30

    NAP

    0.20

    GENSCAN

    0.10

    0.00

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Target protein alignment score

    GENSCAN - Burge, MIT

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    1.00

    0.90

    0.80

    0.70

    0.60

    Intron (Sn + Sp) / 2

    0.50

    GeneSeqer

    0.40

    0.30

    NAP

    0.20

    GENSCAN

    0.10

    0.00

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Target protein alignment score

    GeneSeqer vs GENSCAN

    (Intron prediction)

    GENSCAN - Burge, MIT

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    GeneSeqer: Input

    http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    GeneSeqer: Output

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    GeneSeqer: Gene Evidence Summary

    Brendel 2005

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Gene prediction problems status
    Gene Prediction - Problems & Status?

    Common errors?

    • False positive intergenic regions:

      • 2 annotated genes actually correspond to a single gene

    • False negative intergenic region:

      • One annotated gene structure actually contains 2 genes

    • False negative gene prediction:

      • Missing gene (no annotation)

    • Other:

      • Partially incorrect gene annotation

      • Missing annotation of alternative transcripts

        Current status?

  • For ab initio prediction in eukaryotes:HMMs have better overall performance for detecting untron/exon boundaries

    • Limitation? Training data: predictions are organism specific

  • Combined ab initio/homology based predictions: Improved accurracy

    • Limitation? Availability of identifiable sequence homologs in databases

  • BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Recommended gene prediction software
    Recommended Gene Prediction Software

    • Ab initio

      • GENSCAN:http://genes.mit.edu/GENSCAN.html

      • GeneMark.hmm:http://exon.gatech.edu/GeneMark/

      • others: GRAIL, FGENES, MZEF, HMMgene

    • Similarity-based

      • BLAST, GenomeScan, EST2Genome, Twinscan

    • Combined:

      • GeneSeqer, ROSETTA

    • Consensus:because results depend on organisms & specific task, Always use more than one program!

      • Two servers hat report consensus predictions

        • GeneComber

        • DIGIT

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Other gene prediction resources at isu
    Other Gene Prediction Resources: at ISU

    http://www.bioinformatics.iastate.edu/bioinformatics2go/

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    Other gene prediction resources gatech mit stanford etc
    Other Gene Prediction Resources: GaTech, MIT, Stanford, etc.

    Lists of Gene Prediction Software

    http://www.bioinformaticsonline.org/links/ch_09_t_1.html

    http://cmgm.stanford.edu/classes/genefind/

    Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)

    Chapter 4 Finding Genes

    4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations

    4.2 Using MZEF To Find Internal Coding Exons

    4.3 Using GENEID to Identify Genes

    4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes

    4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm

    4.6 Eukaryotic Gene Prediction Using GeneMark.hmm

    4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome

    4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences

    4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation

    4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

    BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II


    ad