The challenge of annotating a complete eukaryotic genome a case study in drosophila melanogaster
Download
1 / 182

The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster. Martin G. Reese ([email protected]) Nomi L. Harris ([email protected]) George Hartzell ([email protected]) Suzanna E. Lewis ([email protected])

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster' - xenos


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The challenge of annotating a complete eukaryotic genome a case study in drosophila melanogaster l.jpg

The challenge of annotating a complete eukaryotic genome:A case study in Drosophila melanogaster

Martin G. Reese ([email protected])

Nomi L. Harris ([email protected])

George Hartzell ([email protected])

Suzanna E. Lewis ([email protected])

Drosophila Genome CenterDepartment of Molecular and Cell Biology539 Life Sciences AdditionUniversity of California, Berkeley


Abstract l.jpg
Abstract

Many of the technical issues involved in sequencing complete genomes are essentially solved. Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem.

How shall the genomes be annotated, what shall be annotated, which computational tools are most effective, how reliable are these annotations, how organism-specific do the tools have to be and ultimately how should the computational results be presented to the community? All these questions are unsolved. This tutorial will give an overview and assessment of the current state of annotation based upon experiences gained at the Drosophila melanogaster genome project.

In the tutorial we will do three things. First, we will break down the annotation process and discuss the various aspects of the problem. This will serve to clarify the term "annotation", which is often used to collectively describe a process that has a number of discrete steps. Second, with the participation of computational biologists from the community we will compare existing tools for sequence annotation. We will do this by providing a 3 megabase sequence that has already been well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is similar to what has been done at the CASP (critical assessment of techniques for protein structure prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we will discuss which annotation problems are essentially solved and which problems remain.


Tutorial goals l.jpg
Tutorial goals

  • Review the algorithms currently used in annotation

  • Assess existing methods under “field” conditions

  • Identify open issues in annotation


Tutorial organization l.jpg
Tutorial organization

  • Definitions

  • Annotation

    • “Biological” issues

    • “Engineering” issues

    • Application of tools within an existing annotation system

  • Break (20 minutes)

  • Review of existing tools

  • Our annotation experiment

  • Conclusions and outstanding issues


What is a gene l.jpg
What is a gene?

  • Definition: An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn have an influence on some characteristic phenotype of the organism.


What are annotations l.jpg
What are annotations?

  • Definition: Features on the genome derived through the transformation of raw genomic sequences into information by integrating computational tools, auxiliary biological data, and biological knowledge.


How does an annotation differ from a gene l.jpg
How does an annotation differ from a gene?

  • Many annotations are the same as ‘genes’

    • The annotation describes an inheritable trait associated with a region of DNA.

  • But an annotation may not always correspond in this way, e.g. an STS, or sequence overlap

    • Region of genomic DNA or RNA is not translated or transcribed




Sequence feature types l.jpg
Sequence feature types

  • Transcribed region

    • mRNA, tRNA, snoRNA, snRNA, rRNA

  • Structural region

    • Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product

    • Mutations: insertion, deletion, substitution, inversion, translocation

    • Functional or signal region

    • Promoter, enhancer, DNA/RNA binding site, splice site signal, poly-adenylation signal

    • Protein processing: glycosylation, methylation, phosphorylation site

  • Similarity

    • Homolog, paralog, genomic overlap (syntenic region)

  • Other feature types

    • Transposable element, repetitive element

    • Pseudogene

    • STS, insertion site


Dna transcription unit features l.jpg
DNA transcription unit features

  • Promoter elements

    • Core promoter elements

      • TATA box

      • Initiator (Inr)

      • Downstream promoter element (DPE)

    • Transcription factor (“TF”) binding sites

      • CAAT boxes

      • GC boxes

      • SP-1 sites

      • GAGA boxes

    • Enhancer site(s)


Mrna features l.jpg
mRNA features

  • Exon

    • Initial, internal, terminal

      • Codon usage, preference

      • Control elements (e.g. splice enhancers)

  • Intron

    • 5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”)

    • Repeat elements

  • Start codon (translation start site)

    • “Kozak” rule

  • UTR (untranslated regions)

    • 5’ UTR

      • Translation regulatory elements

      • RNA binding sites

    • Initial, internal, terminal

      • Control elements (e.g. splice enhancers)

    • 3’ UTR

      • RNA binding sites (cis-acting elements)

  • Stop codon

  • Poly-adenylation signal and site

  • RNA destabilization signal


Definitions for data modeling l.jpg
Definitions for data modeling

  • Feature: An interval or an ordered set of intervals on a sequence that describes some biological attribute and is justified by evidence.

  • Sequence: A linear molecule of DNA, RNA or amino acids.

  • Evidence: A computational or experimental result coming out of an analysis of a sequence

  • Annotation: A set of features


Annotation l.jpg

Detailed analysis (typically biological) of single genes

Large-scale analysis (typically computational) of entire genome

Annotation

Annotated genome

Depth of knowledge

Breadth of knowledge


Annotation process overview l.jpg
Annotation process overview

Methods

Data

Genome

Sequence

Auxiliary

Data

Computational

Tools

Database

Resources

Annotation Systems

Understanding of a Genome


Types of sequence data l.jpg
Types of sequence data

  • Chromosomal sequence

    • Euchromatic

    • Heterochromatic

  • mRNA sequences

    • Full length cDNA

    • 5’ EST

    • 3’ EST

  • Protein sequences

  • Insertion site flanking sequences


Auxiliary data l.jpg
Auxiliary data

  • Maps

    • Genetic, physical, radiation hybrid map (RH), deletion, cytogenetic

  • Expression data

    • Tissue, stage

  • Phenotypes

    • Lethality, sterility


Computational annotation tools l.jpg
Computational annotation tools

  • Gene finding

  • Repeat finding

  • EST/cDNA alignment

  • Homology searching

    • BLAST, FASTA, HMM-based methods, etc.

  • Protein family searching

    • PFAM, Prosite, etc.


Database resources l.jpg
Database resources

  • Curated sequence feature data sets

    • Repeat elements

    • Transposons

    • Non-redundant mRNA

    • STSs and other sequence markers

  • Genome sequence from related species

    • D. melanogaster vs. D. virilis, D. hydei

  • Genome sequence from more distant species

  • Protein sequences from distant species


Biological issues in annotation l.jpg
Biological issues in annotation

  • Common

    • Genes within genes

    • Alternative splicing

    • Alternative poly-adenylation sites

  • Rare

    • Translational frame shifting

    • mRNA editing

    • Eukaryotic operons

    • Alternative initiation


Engineering issues in annotation l.jpg
Engineering issues in annotation

  • What sequence to start with?

    • Because features are intervals on a sequence, problems can be caused by gaps, frameshifts, and other changes to the sequence. How do you track these changes over time and model features that span gaps?

  • When to annotate?

    • Feature identification can aid in sequencing. It may be advisable to carry out sequencing and annotation in parallel thus enabling them to complement one another.

  • What analyses need to be run and how?

    • What dependencies are there between various analysis programs?

    • What parameters settings to use?


Engineering issues in annotation23 l.jpg
Engineering issues in annotation

  • What public sequence data sets are needed?

    • What are the mechanics of obtaining public sequence databases?

    • Are curated data sets available or do you need to set up a means of maintaining your own (for repeats, insertions, organism of interest)

  • How do you achieve computational throughput?

    • Workstation farm, or simply a big, powerful box?

    • Job flow control

  • What do you do with the results?

    • Homogenize results into single format?

    • Filter results for significance and redundancy


Engineering issues in annotation24 l.jpg
Engineering issues in annotation

  • Interpreting the results

    • Is human curation needed?

    • How can you achieve consistency between curators?

    • How do you design the user interface so that it is simple enough to get the task completed speedily but complex enough to deal with biology?

    • How do you capture curations?

  • How are annotation translations to be described?

    • EC terminology

    • ProSite families

    • Pfam domains

    • Is function distinguishable from process?


Engineering issues in annotation25 l.jpg
Engineering issues in annotation

  • How do you manage data?

    • What is the appropriate database schema design?

    • How is the database to be kept up to date? Will it be directly from programs running user interfaces and analyses or via a middleware layer?

    • Is a flat file format needed and what should it be?

    • What query and retrieval support is needed?

  • How do you distribute data?

    • For bulk downloads what is the format of the data?

    • What information is best summarized in tables?

    • What information requires an integrated graphical view?


Engineering issues in annotation26 l.jpg
Engineering issues in annotation

  • How do you update the annotations?

    • How frequently are they re-evaluated?

    • How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)?

    • How can differences between old and new computational results be detected?

    • Changes in computational results may need to trigger changes in curated annotations


Drosophila melanogaster l.jpg
Drosophila melanogaster

  • Drosophila is the most important model organism*

  • Drosophila genome:

    • 4 chromosomes

    • 180 Mb total sequence

    • 140 Mb euchromatic sequence

    • 12-14,000 genes

* source: G.M. Rubin


Drosophila genome project l.jpg
Drosophila Genome Project

  • Laboratories working on Drosophila sequencing:

    • BDGP (Berkeley Drosophila Genome Project)

    • EDGP (European Drosophila Genome Project)

    • Celera Genomics Inc.

  • “Complete” D. melanogaster sequence will be finished by the end of 1999

  • Comprehensive database - FlyBase


Goals of the drosophila genome project l.jpg
Goals of the Drosophila Genome Project

  • Complete genome sequence

  • Structure of all transcripts

  • Expression pattern of all genes

  • Phenotype resulting from mutation of all ORFs

  • And more...


Sequencing at the bdgp l.jpg
Sequencing at the BDGP

  • Genomic sequence

    • P1 and BAC clones

    • 24Mb of completed sequence (as of July 22, 1999)

    • 18Mb unfinished sequence in process

  • Complete tiling path in BACs

    • 1.5x-path draft sequencing

  • ESTs and cDNAs

    • 80,942 ESTs finished (as of March 19, 1999)

    • Over 800 full-length cDNAs



What sequence to start with l.jpg
What sequence to start with?

  • Unit of sequencing at the BDGP

    • Completed high-quality clone sequences

  • Reassembling the genomic sequence

    • Need to place clones in correct genomic positions

    • Need to integrate genes that span multiple clones

    • Solved by using genomic overlaps to reconstitute full genomic sequence


Which analyses need to be run l.jpg
Which analyses need to be run?

  • Similarity searches

    • BLAST (Altschul et al., 1990)

      • BLASTN (nucleotide databases)

      • BLASTX (amino acid databases)

      • TBLASTX (amino acid databases, six-frame translation)

    • sim4 (Miller et al., 1998)

      • Sequence alignment program for finding near-perfect matches between nucleotide sequences containing introns

  • Gene predictors

    • Genefinder (Green, unpublished)

    • GenScan (Burge and Karlin, 1997)

    • Genie (Reese et al., 1997)

  • Other analyses

    • tRNAscanSE (Lowe and Eddy, 1996)


Which analyses need to be run and how l.jpg
Which analyses need to be run and how?

  • mRNAs

    • ORFFinder(Frise, unpublished)

  • Protein translations

    • HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer et al. 1997, Bateman et al. 1999)

    • Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with EMOTIF ( Nevill-Manning et al. 1998)

    • Psort II (Horton and Nakai 1997)

    • ClustalW (Higgins et al. 1996)


What public sequence data sets are needed l.jpg
What public sequence data sets are needed?

  • Automating updates of public databases:

    • Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP

  • Curated data sets

    • D. melanogaster genes (FlyBase)

    • Transposable elements (EDGP)

    • Repeat elements (EDGP)

    • STSs (BDGP)



How do you achieve computational throughput l.jpg
How do you achieve computational throughput?

  • BDGP computing power

    • Sun Ultra 450 (3 machines, 4 processors each)

    • Sun Enterprise (1 machine, 8 processors)

    • Used these directly, without any system for distributed computing.

  • Job flow control: the Genomic Daemon

    • Automatic batch analysis of genomic clones

    • Berkeley Fly Database is used for queuing system and storage of results

    • Many clones can be analyzed simultaneously

    • Results are processed and saved in XML format for interactive browsing


What do you do with the results l.jpg
What do you do with the results?

  • Berkeley Output Parser (BOP)

    • Input to BOP:

      • Genomic sequence

      • Results of computational analyses

      • Filtering preferences

    • Parses results from BLAST, sim4, GeneFinder, GenScan, and tRNAscan-SE analyses

    • Filters BLAST and sim4 results

      • Eliminates redundant or insignificant hits

      • Merges hits that represent single region of homology

    • Homogenizes results into single format

      • Output: sequence and filtered results in XML format


Is human curation needed l.jpg
Is human curation needed?

  • Not for everything

    • Some features are obvious and can be identified computationally

      • Known D. melanogaster genes are detected automatically by GeneSkimmer

      • Repetitive elements

  • But still for many things

    • Annotating complete gene structure is still hard

    • We use CloneCurator (BDGP’s Java graphical editor) for curation


Gene skimmer l.jpg
Gene Skimmer

  • Quick way of identifying genes in new sequence before curation

  • Start with XML output from BOP

  • Look for sim4 hits with known Drosophila genes

  • Find gene hits with sequence identity >98%, coverage >30%

  • Verify that hits represent real genes


Gene skimmer41 l.jpg
Gene Skimmer

URL: http://www.fruitfly.org/sequence/genomic-clones.html


Clonecurator l.jpg
CloneCurator

  • Displays computational results and annotations on a genomic clone

  • Interactive browsing

    • Zoom/scroll

    • Change cutoffs for display of results

    • Analyze GC content, restriction sites, etc.

  • Interactive annotation editing

    • Expert “endorses” selected results

  • Presents annotations to community via Web site


How do we annotate gene protein function l.jpg
How do we annotate gene/protein function?

  • Gene Ontology Project

    • Controlled hierarchical vocabulary for multiple-genome annotations and comparisons

    • Standardized vocabulary facilitates collaboration

    • Good data modeling allows better database querying

    • Ontology browser provides interactive search of hierarchical terms

    • “GO” project (http://www.ebi.ac.uk/~ashburn/GO)




How do you distribute the data l.jpg
How do you distribute the data?

  • Bulk downloads

    • FASTA at http://www.fruitfly.org/sequence/download.html

    • Curated data sets

  • Tabular data

    • Athttp://www.fruitfly.org/sequence/

    • Sequenced genomic clones

    • Clone contigs sorted by genomic location

    • Clone contigs sorted by size

  • Ribbon provides integrated graphical view of annotations on physical contigs


Ribbon l.jpg
Ribbon

  • Human curator annotates individual clones (~100Kb)

  • Clones are assembled into physical contigs (regions of physical map)

  • Clone annotations are merged and renumbered for display on whole physical contigs

  • Ribbon is our Java display tool for displaying curated annotations on physical contigs

  • Will soon be available on Web



How do you manage the data l.jpg
How do you manage the data?

  • Using Informix as our database server

  • Updated via Perl dbi.pm module

  • Development underway in

    • Schema revisions

    • GAME DTD (Genome Annotation Markup Entities)

    • Perl module for annotation objects

    • http://www.bioxml.org/ (Ewan Birney)


How do you maintain annotations l.jpg
How do you maintain annotations?

  • Open questions

    • How frequently are annotations re-evaluated?

    • How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)?

    • How can differences between old and new computational results be detected?

    • Changes in computational results may need to trigger changes in curated annotations


Integrated annotation systems l.jpg
Integrated annotation systems

  • ACeDB

  • Genotator

  • Magpie

  • GAIA

  • TIGR


Integrated annotation systems acedb l.jpg
Integrated annotation systems: ACeDB

  • Developed for analysis of the C. elegans genome

  • Sophisticated database designed for storing annotations and related information

  • New Java and Web-based versions available

  • Written by Jean Thierry-Mieg and Richard Durbin

  • http://www.sanger.ac.uk/Software/Acedb/



Genotator l.jpg
Genotator

  • Back end automates sequence analysis; browser provides interactive viewing and editing of annotations

  • Nomi Harris (1997), Genome Research 7(7), 754-762.

  • http://www-hgc.lbl.gov/inf/annotation.html


Magpie l.jpg
Magpie

  • Expert system based (PROLOG)

    • Data collection daemon

    • Data analysis and report daemon

  • “Intelligent” integration of various individual feature prediction systems

  • Allows human interactions

  • Gaasterlund and Sensen (1996), TIG, 12, 76-78.

  • http://genomes.rockefeller.edu/magpie/magpie.html


Slide58 l.jpg
GAIA

  • Web-based system

  • Results displayed as Java applets

  • Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J. Crabtree, D.B. Searls, and G.C. Overton (1998), Genome Research.

  • http://daphne.humgen.upenn.edu:1024/gaia/


Tigr human gene index l.jpg
TIGR Human Gene Index

  • Gene Indices for various organisms

  • Databases for transcribed genes linked into external/internal genomic databases

  • Internal backend analysis software

  • http://www.tigr.org/tdb/tdb.html


Computational analysis tools l.jpg
Computational analysis tools

  • Gene finding

  • Repeat finding

  • EST/cDNA alignment

  • Homology searching

    • BLAST, FASTA, HMM-based methods, etc.

  • Protein family searching

    • PFAM, Prosite, etc.


Gene finding prokaryotes vs eukaryotes l.jpg
Gene finding: Prokaryotes vs. Eukaryotes

  • Prokaryotes

    • Contiguous open reading frames (ORF)

    • Short intergenic sequences

    • Good method: detecting large ORFs

    • Complications:

      • Partial sequences

      • Sequencing errors

      • Start codon prediction

      • Overlapping genes on both strands


Gene finding prokaryotes vs eukaryotes62 l.jpg
Gene finding: Prokaryotes vs. Eukaryotes

  • Eukaryotes

    • Complex gene structures (exon/introns)

      • D. melanogaster has an average of 4 introns/gene

      • Very long genes (D. melanogaster X gene 160 kb)

      • Very long introns

      • Many introns

      • “Nested”, overlapping, and alternatively spliced genes

      • 5’ UTRs with non-coding exons

      • Long 3’ UTRs

      • Complex transcription machinery

    • ORF-finding alone is not adequate


Integrated gene finding l.jpg
Integrated gene finding

  • Assumptions

    • Signals and content method sensors alone are not sufficient for predicting gene structure

    • Gene structure is hierarchical

    • Each component (exon, intron, splice site, etc.) can be modeled independently

  • The approach

    • Generate a list of candidates for each component (with scores)

    • Assemble the components into a “gene model”


Integrated gene finding dynamic programming l.jpg
Integrated gene finding: Dynamic programming

  • Determines the best combination of components

  • Two-part problem:

    • Develop an “optimal” scoring function

    • Use dynamic programming to find an “optimal” alignment through scoring matrix


Integrated gene finding dynamic programming65 l.jpg
Integrated gene finding: Dynamic programming


Integrated gene finding linear and quadratic discriminant analysis lda qda l.jpg
Integrated gene finding: Linear and Quadratic Discriminant Analysis (LDA/QDA)

  • LDA

    • Deterministic calculation of thresholds

    • n-class discrimination

    • Example:

      • HSPL, Solovyev et al. (1997), ISMB, 5,294-302.

  • QDA

    • Can represent a great improvement over LDA

    • Example:

      • MZEF, Michael Zhang (1997), PNAS, 94, 565-568.


Integrated gene finding feed forward neural networks l.jpg
Integrated gene finding: Feed-forward neural networks

  • Supervised learning

  • Training to discriminate between several feature classes

  • Computing units

  • Gradient descent optimization

  • Multi-layer networks

  • Limitations

    • Black-box predictions

    • Local minima

  • Example:

    • GRAIL, Uberbacher et al. (1991), PNAS, 88, 11261-11265.


Approaches to gene finding hidden markov models l.jpg
Approaches to gene finding: Hidden Markov models

  • Model

    • A finite model describing a probability distribution over all possible sequences of equal length

    • “Natural” scoring function

    • (Conditional) Maximum likelihood “training”

  • Markov

    • k-order Markov chain: current state dependent on k previous states

    • The next state in a 1st-order Markov model depends on current state

  • Hidden

    • Hidden states generate visible symbols

  • Assumptions

    • Independence of states

      • No long range correlation

  • Example: HMMgene, A. Krogh (1998), In Guide to Human Genome Computing, 261-274.


Approaches to gene finding generalized hidden markov models l.jpg
Approaches to gene finding: Generalized hidden Markov models

  • Each HMM state can be a probabilistic sub-model

  • Complex hierarchical system

  • Requires care in modeling state overlaps

  • Example:

    • Genie, Kulp et al. (1996), ISMB, 4, 134-142

    • GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94


Gene finding software l.jpg
Gene finding software

  • Signal recognition

    • Promoter prediction

    • Splice site prediction

    • Start codon prediction

    • Poly-adenylation site prediction

  • Coding potential

  • Coding exons

  • Gene structure prediction

    • Spliced alignment

    • LDA/QDA

    • Neural networks

    • HMMs and GHMMs


Promoter recognition l.jpg
Promoter recognition

  • PromoterScan

    • Identify potential promoter regions

    • Based on databases of known TF binding sites

      • TFD (Gosh (1991), TIBS, 16, 445-447)

      • TRANSFAC (Heinemeyer et al. (1999), NAR, 27, 318-322)

    • Prestridge (1995), JMB, 249, 923-932

    • http://bimas.dcrt.nih.gov/molbio/proscan/

  • MatInd and MatInspector

    • Finding consensus matches to known TF binding sites

    • Based on TRANSFAC

      • Heinemeyer et al. (1999), NAR, 27, 318-322

    • Quandt et al. (1995), NAR, 23, 4878-4884.

    • http://transfac.gbf.de/TRANSFAC/


Promoter recognition cont l.jpg
Promoter recognition (cont.)

  • TSSG/TSSW

    • LDA based combination of several features (TATA-box, Inr signal, upstream regions)

    • Solovyev et al. (1997), ISMB, 5, 294-302.

    • http://genomic.sanger.ac.uk/gf/gf.shtml

  • Transcription Element Search Software

    • Identify TF binding sites

    • Based on TRANSFAC

    • http://agave.humgen.upenn.edu/tess/index.html


Promoter recognition cont73 l.jpg
Promoter recognition (cont.)

  • CBS Promoter 2.0 Prediction Server

    • Simulated transcription factors

    • Principles common to neural networks and genetic algorithms

    • Knudsen (1999), Bioinformatics13(5), 356-361.

    • http://genome.cbs.dtu.dk/services/promoter/

  • CorePromoter

    • Position dependent 5-tuple

    • QDA

    • Michael Zhang (1998), Genome Research,8, 319-326.

    • http://scislio.cshl.org/genefinder/CPROMOTER/


Promoter recognition cont74 l.jpg
Promoter recognition (cont.)

  • Neural network promoter prediction (NNPP)

    • Time-delay neural network

    • Combining TATA box and initiator

    • Reese (1999), in preparation.

    • http://www-hgc.lbl.gov/projects/promoter.html



Promoter recognition cont76 l.jpg
Promoter recognition (cont.)

  • Markov chain promoter finder

    • Competing interpolated Markov chains for promoters, exons, introns

    • Promoter model consists of five states representing the core promoter parts

    • Ohler, Reese et al., Bioinformatics13(5), 362-369.


Splice site prediction l.jpg
Splice site prediction

  • Nakata, 1985

    • Nakata (1985), NAR, 13(14), 5327-5340.

  • BCM GeneFinder

    • HSPL - Prediction of splice sites in human DNA sequences

    • Triplet frequencies in various functional parts of splice site regions

    • Combined with codon statistics

    • Solovyev et al. (1994), NAR, 22(24), 5156-5163.

    • http://genomic.sanger.ac.uk/gf/gf.shtml


Splice site prediction cont l.jpg
Splice site prediction (cont.)

  • Neural Network splice site predictor (NNSPLICE)

    • Multi-layered feed-forward neural network

    • Modeled after Brunak et al. (1991), JMB, 220, 49-65.

    • Reese et al. (1997), JCB, 4(3), 311-323.

    • http://www-hgc.lbl.gov/projects/splice.html

  • NetGene2

    • Combination of neural networks and rule-based system

    • Splice site signal neural network combined with coding potential

    • Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.

    • Brunak et al. (1991), JMB, 220, 49-65.

    • http://www.cbs.dtu.dk/services/NetGene2/


Splice site prediction cont79 l.jpg
Splice site prediction (cont.)

  • SplicePredictor

    • Logitlinear models for splice site regions

      • Degree of matching to the splice site consensus

      • Local compositional contrast

    • Brendel and Kleffe (1998), NAR, 26(20), 4748-4757.

    • http://gnomic.stanford.edu/~volker/SplicePredictor.html


Start codon prediction l.jpg
Start codon prediction

  • NetStart

    • Trained on cDNA-like sequences

    • Neural network based

      • Local start codon information

      • Global sequence information

    • Pedersen and Nielsen (1997), ISMB, 5, 226-233.

    • http://www.cbs.dtu.dk/services/NetStart/


Poly adenylation signal prediction l.jpg
Poly-adenylation signal prediction

  • BCM GeneFinder

    • POLYAH - Recognition of 3'-end cleavage and poly-adenylation region

    • Triplet frequencies in various functional parts in poly-adenylation regions

    • LDA

    • Solovyev et al. (1994), NAR, 22(24), 5156-5163.

    • http://genomic.sanger.ac.uk/gf/gf.shtml


Prediction of coding potential l.jpg
Prediction of coding potential

  • Periodicity detection

    • Coding sequences have an inherent periodicity of three

    • Especially good on long coding sequences

    • Auto-correlation

      • Seeking the strongest response when shifted sequence is compared with original

      • Michel (1986), J. Theor. Biol. 120, 223-236.

    • Fourier transformation: Spectral analysis

      • Detection of peak at position corresponding to 1/3 of the frequency

      • Silverman and Linsker (1986), J. Theor. Biol.118, 295-300.


Prediction of coding potential cont l.jpg
Prediction of coding potential (cont.)

  • Trifonov (1980;1987)

    • G-notG-U periodicity

    • JMB , 194, 643-652.

  • Fickett (1982)

    • Position asymmetry in the three codon positions

    • NAR10(17), 5303-5318.

  • Staden (1984)

    • Codon usage in tables

    • NAR12, 551-567.


Prediction of coding potential cont84 l.jpg
Prediction of coding potential (cont.)

  • Claverie and Bougueleret (1987)

    • Hexamer frequency differentials

    • NAR14, 179-196.

  • Fichant and Gautier (1987)

    • Codon usage homogeneity

    • CABIOS, 3(4), 287-295.

  • GRAIL I (1991)

    • Neural network using a shifting fixed size window

    • 7 sensors as input, 2 hidden layers and 1 unit as output

    • Uberbacher et al. (1991), PNAS, 88(24), 11261-11265.


Prediction of coding potential cont85 l.jpg
Prediction of coding potential (cont.)

  • GeneMark (1986)

    • Inhomogeneous Markov chain models

    • Easy trainable (closed solution for Maximum Likelihood)

    • Used extensively in prokaryotic genomes

    • Borodovsky et al. (1993), Computers & Chemistry, 17, 123-133.

  • Glimmer (1998)

    • Interpolated Markov chains from first to eighth order

    • Salzberg et al. (1998), NAR, 26(2), 544-548.

    • http://www.tigr.org/softlab/glimmer/glimmer.html


Prediction of coding potential cont86 l.jpg
Prediction of coding potential (cont.)

  • Review by Fickett (1992)

    • “Assessment of protein coding measures”, NAR, 20, 6441-6450.


Prediction of coding exons l.jpg
Prediction of coding exons

  • SorFind

    • Detection of “spliceable” ORFs

    • Hutchinson, NAR, 20(13), 3453-3462.

  • BCM GeneFinder

    • FEXD, FEXN, FEXA, FEXY, FEXH, HEXON

    • LDA

    • Solovyev et al. (1994), NAR, 22(24), 5156-5163.

    • http://genomic.sanger.ac.uk/gf/gf.shtml

  • GRAIL II

    • Exon candidates, heuristic integration, learning with neural network

    • Uberbacher et al., Genet. Eng., 16, 241-253.

    • http://compbio.ornl.gov/


Integrated gene models lda qda l.jpg
“Integrated” gene models: LDA/QDA

  • FGene

    • LDA based

    • Dynamic programming for the integration of LDA output

    • Solovyev et al. (1995), ISMB, 3, 367-375.

    • http://genomic.sanger.ac.uk/gf/gf.shtml


Integrated gene models nn l.jpg
“Integrated” gene models: NN

  • GeneParser

    • “Gene-parsing” approach

    • Potential alternative splicing recognized

    • Neural network and dynamic programming

    • Snyder and Stormo (1995), JMB, 248, 1-18.


Integrated gene models artificial intelligence approaches l.jpg
“Integrated” gene models: Artificial intelligence approaches

  • GeneID

    • Rule-based system

    • Homology integration

    • Guigó et al. (1992), JMB , 226, 141-157.

    • http://www1.imim.es/geneid.html

  • GeneID using DP

    • DP to combine a set of potential exons

    • Guigó et al. (1998), JCB , 5, 681-702.


Integrated gene models artificial intelligence approaches91 l.jpg
“Integrated” gene models: Artificial intelligence approaches

  • GenLang

    • Syntactic pattern recognition system

    • Formal grammar

    • Tools from computational linguistics

    • Dong and Searls (1994), Genomics, 23,540-551.

    • http://cbil.humgen.upenn.edu/~sdong/genlang_home.html


Integrated gene models hmms l.jpg
“Integrated” gene models: HMMs approaches

  • HMMGene

    • Several genes per sequence possible

    • User constraints possible

    • Krogh (1997), ISMB, 5, 179-186.

    • http://www.cbs.dtu.dk/services/HMMgene/

  • GeneMark.hmm

    • Based on GeneMark program for bacterial sequences

    • Can predict frame shifts

    • Trained for various organisms

    • Lukashin and Borodovsky (1998), NAR, 26, 1107-1115.

    • http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html


Integrated gene models ghmms l.jpg
“Integrated” gene models: GHMMs approaches

  • Genie

    • Generalized hidden Markov model with length distribution

    • Integration of multiple content and signal sensors

      • Content: codon statistics, repeats, intron, intergenic, database homology hits

      • Signal: promoter, start codon, splice sites, stop codon

    • Dynamic programming to find optimal parse

    • Several genes per sequence possible

    • Kulp et al. (1996), ISMB, 4, 134-142.

    • Reese et al. (1997), JCB, 4(3), 311-323.

    • http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie


Example genie l.jpg
Example: Genie approaches


Integrated gene models ghmms95 l.jpg
“Integrated” gene models: GHMMs approaches

  • GenScan

    • Multiple content and signal models

    • Semi-hidden Markov model sensors with length distribution

    • Takes GC content into account (separate models)

    • Several genes per sequence possible

    • Burge and Karlin (1997), JMB, 268(1), 78-94.

    • http://CCR-081.mit.edu/GENSCAN.html


Est cdna alignment for gene finding spliced alignments l.jpg
EST/cDNA alignment for gene finding: Spliced alignments approaches

  • PROCRUSTES

    • Spliced alignment algorithm

    • Dynamic programming to combine a set of potential exons

    • Frame conservation

    • Homologous sequence needed

    • Gelfand et al. (1996), PNAS, 93, 9061-9066.

    • http://hto-13.usc.edu/software/procrustes/


Est cdna alignment l.jpg
EST/cDNA alignment approaches

  • Sim4

    • Aligns cDNA to genomic sequence

    • Uses local similarity

    • Florea et al. (1998), Genome Research, 8, 967-974.

  • GeneWise

    • Dynamic programming

    • Partial genes allowed

    • Based on Pfam and statistical splice site models

    • Birney (1999), unpublished

    • http://www.sanger.ac.uk/Software/Wise2


Est cdna alignment cont l.jpg
EST/cDNA alignment (cont.) approaches

  • ACEMBLY

    • Aligns ESTs to genomic sequence

    • Identifies alternative splicing

    • Integrated in ACeDB

    • Jean Thierry-Mieg (unpublished)


Repeat finders l.jpg
Repeat finders approaches

  • Censor

    • Uses database of repeat sequences

    • Jurka et al. (1996), Comp. and Chem., 20(1), 119-122.

  • BLAST

    • Integrated masking operations

    • XBLAST procedure

      • Claverie (1994), In Automated DNA Sequencing and Analysis Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., 267-279.

    • http//:www.ncbi.nlm.nih.gov/BLAST


Repeat finders cont l.jpg
Repeat finders (cont.) approaches

  • RepeatMasker

    • Detection of interspersed repeats

    • Smit and Green, unpublished results

    • http://ftp.genome.washington.edu/RM/RepeatMasker.html


Homology searching l.jpg
Homology searching approaches

  • BLAST suite

    • BLASTN, BLASTX, TBLASTX, PSI-BLAST

    • Altschul et al. (1990), JMB, 215, 403-410.

    • http://www.ncbi.nlm.nih.gov/BLAST

  • FASTA suite

    • FASTA, TFASTA

    • Pearson and Lipman (1988), PNAS, 85, 2444-2448.

  • HMM-based searching

    • SAM (UCSC group)

      • http://www.cse.ucsc.edu/research/compbio/sam.html

    • HMMER, Sean Eddy

      • http://hmmer.wustl.edu/


Gene family searching l.jpg
Gene family searching approaches

  • BLOCKS

    • http://www.blocks.fhcrc.org

  • PROSITE

    • http://www.expasy.ch/prosite/

  • PFAM

    • http://pfam.wustl.edu/

  • SCOP

    • http://scop.mrc-lmb.cam.ac.uk/scop/


The genome annotation experiment gasp1 l.jpg
The genome annotation experiment (GASP1) approaches

  • Genome Annotation Assessment Project (GASP1)

  • Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA

  • Open to everybody, announced on several mailing lists

  • Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods.

  • “CASP” like

  • 12 participating groups



Goals of the experiment l.jpg
Goals of the experiment approaches

  • Compare and contrast various genome annotation methods

  • Objective assessment of the state of the art in gene finding and functional site prediction

  • Identify outstanding problems in computational methods for the annotation process


Adh contig l.jpg
Adh approachescontig

  • 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions

    • From chromosome 2L (34D-36A)

    • Ashburner et al., (to appear in Genetics)

    • 222 gene annotations (as of July 22, 1999)

    • 375,585 bases are coding (12.95%)

  • We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.


Adh paper to appear in genetics l.jpg
Adh approaches paper (to appear in Genetics)

URL: http://www.fruitfly.org/publications/PDF/ADH.pdf


Raw sequence adh fa l.jpg
Raw sequence: approachesAdh.fa

GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA


Drosophila data sets provided to participants l.jpg
Drosophila data sets provided to participants approaches

  • Curated Drosophila nuclear DNA "coding sequences" (CDS)

  • Curated non-redundant Drosophila genomic DNA data (275 “multi”- and 144 “single”-exon sequence entries from Genbank)

  • Drosophila 5' and 3' splice sites

  • Drosophila start codon sites

  • Drosophila promoter sequences

  • Drosophila repeat sequences

  • Drosophila transposon sequences

  • Drosophila cDNA sequences

  • Drosophila EST sequences

URL: http://www.fruitfly.org/GASP1/data/data.html


Timetable l.jpg
Timetable approaches

  • May 13, 1999 - June 30, 1999

    • Distribution of the sample sequence and associated data to the predictors. Collection of predictions.

  • June 30, 1999 - July 31, 1999

    • Evaluation of the predictions by the Drosophila Genome Center.

  • August 4, 1999

    • External expert assessment of the prediction results (HUGO meeting, EMBL)

  • August 6, 1999

    • Tutorial #3 at the ISMB ‘99 conference in Heidelberg, Germany


Resources for assessing predictions l.jpg
Resources for assessing predictions approaches

  • 80 cDNA sequences NOT in Genbank before experiment deadline

    • Sequenced from 5 different cDNA libraries

    • 3 paralogs to other genes in the genome

    • 19 cDNAs with cloning artifacts

      • 2 apparently representing unspliced RNA

      • Multiple inserts (2 cDNAs cloned in the same vector)

    • 58 “usable” cDNAs

  • 33 cDNA sequences in Genbank during experiment

  • Annotations from Adh paper


Curated data sets for assessing predictions l.jpg
Curated data sets for assessing predictions approaches

  • Standard 1 (Adh.std1.gff) “conservative gene set”

    • 43 gene structures (7 single- and 36 multi- coding exon genes)

    • Criteria for inclusion:

      • >=95% (most >=99%) of the cDNA aligned to genomic DNA (using sim4)

      • “GT”/”AG” splice site consensus sequences

      • Splice site score from neural net

        • 5’ splice sites: >=0.35 threshold ( 98% True Positive score)

        • 3’ splice sites: >=0.25 threshold ( 92% True Positive score)

      • Start codon and stop codon annotations from Standard 3 (derived from Adh paper)

    • These 43 genes represent “typical” genes


Curated data sets for assessing predictions113 l.jpg
Curated data sets for assessing predictions approaches

  • Standard 2 (Adh.std2.gff)

    • Superset of Standard 1

    • 15 additional gene structures

    • Same alignment criteria as Standard 1 but no splice site consensus requirement

    • Not used in the experiment


Curated data sets for assessment l.jpg
Curated data sets for assessment approaches

  • Standard 3 (Adh.std3.gff) “more complete gene set”

    • 222 gene structures (39 single- and 183 multi- coding exon genes)

    • Criteria:

      • Annotated as described in Ashburner et al.

      • cDNA to genomic alignment using sim4

      • Start codons predicted by ORFFinder (Frise et al., unpublished)

      • ~182 genes have similarity to a homologous protein sequence in another organism or have a Drosophila EST hit

        • Edge verification by partial EST/cDNA alignments

        • BLASTX, TBLASTX homology results

        • PFAM alignments

        • Gene structure verification using GenScan (human)

      • 14 genes had EST/homology hits but no gene finding predictions

      • ~40 genes only have “strong” GenScan predictions


Submission format l.jpg
Submission format approaches

  • GFF (Durbin and Haussler, 1998, unpublished)

    • http://www.sanger.ac.uk/Software/GFF/


Sample submission l.jpg
Sample submission approaches

# organism: Drosophila melanogaster

# std1

Adh std1 TFBS 32002 32006 . + .

Adh std1 TATA_signal 32009 32012 . + . transcript "1"

Adh std1 TSS 32033 32034 . + . transcript "1"

Adh std1 prim_transcript 32034 33122 . + . transcript "1"

Adh std1 exon 32034 32277 . + . transcript "1"

Adh std1 start_codon 32122 32124 . + . transcript "1"

Adh std1 CDS 32122 32277 . + . transcript "1"

Adh std1 splice5 32277 32278 . + . transcript "1"

Adh std1 splice3 32332 32333 . + . transcript "1"

Adh std1 exon 32785 32830 . + . transcript "1"

Adh std1 CDS 32785 32830 . + . transcript "1"

Adh std1 splice5 32830 32831 . + . transcript "1"

Adh std1 splice3 32825 32826 . + . transcript "1"

Adh std1 CDS 32826 33003 . + . transcript "1"

Adh std1 exon 32826 33122 . + . transcript "1"

Adh std1 stop_codon 33001 33003 . + . transcript "1"

Adh std1 polyA_signal 33090 33095 . + . transcript "1"

Adh std1 polyA_site 33101 33102 . + . transcript "1"

Adh std1 prim_transcript 38100 41973 . - . transcript "2"

Adh std1 exon 38100 41973 . - . transcript "2"

Adh std1 polyA_site 39620 39621 . - . transcript "2"

Adh std1 polyA_signal 39685 39690 . - . transcript "2"

Adh std1 stop_codon 40125 40127 . - . transcript "2"

Adh std1 CDS 40125 40390 . - . transcript "2"

Adh std1 start_codon 40388 40390 . - . transcript "2"

Adh std1 TSS 41973 41974 . - . transcript "2"

Adh std1 TATA_signal 41998 42001 . - . transcript "2"

Adh std1 TFBS 42187 42193 . - .

Adh std1 TFBS 42211 42216 . - .

Gene 1

Gene 2


Submissions l.jpg
Submissions approaches

  • MAGPIE Team

    • Credit

      • Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz Kurban, Paul Gordon, Christoph Sensen

      • Laboratory for Computational Genomics, Rockefeller and Institute for Marine Biosciences, Canada

    • Method

      • Automatic genome analysis system integrating Drosophila Genscan predictions, confirming exons boundaries using database searches, repeat finding (Calypso, REPupter) and gene function annotations.


Submissions cont l.jpg
Submissions (cont.) approaches

  • References

    • “Multigenome MAGPIE” poster at ISMB ‘99.

    • Gaasterland and Ragan (1998), J. of Microbial and Comparative Genomics, 3, 305-312.

    • Gaasterland and Sensen (1996), Biochimie78, 302-310.

    • REPupter: Kurtz and Schleiermacher (1999), Bioinformatics15(5), 426-427.


Submissions cont119 l.jpg
Submissions (cont.) approaches

  • Computational Genomics Group, The Sanger Centre

    • Credit

      • Victor Solovyev, Asaf Salamov

    • Method

      • Discriminant analysis based gene prediction programs FGenes (trained for Human) and FGenesH (trained for Drosophila); Combining the output of Fgenes, FGenesH and BLAST using FGenesH+. 3 different “threshold” annotations are submitted.

      • The programming running time is linear with the sequence length.

      • Automatic, plus additional user interactive screening.

      • Non-redundant NCBI database used for BLAST.

    • URL/References

      • http://genomic.sanger.ac.uk/gf/gf.shtml


Submissions cont120 l.jpg
Submissions (cont.) approaches

  • Genome Annotation Group, The Sanger Centre

    • Credit

      • Ewan Birney

    • Method

      • Protein family based gene identification using Wise2 (previously Genewise) and PFAM.

    • URL

      • http://www.sanger.ac.uk/Software/Wise2


Submissions cont121 l.jpg
Submissions (cont.) approaches

  • Pattern Recognition, The University of Erlangen

    • Credit

      • Uwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich Niemann

    • Method

      • Promoter recognition based on interpolated Markov chains; “Genscan” like promoter model (MCPromoter); maximal mutual information based estimation of interpolated Markov chains.

      • Automatic.

      • Promoter training data set from http://www.fruitfly.org/data/genesets


Submissions cont122 l.jpg
Submissions (cont.) approaches

  • References

    • Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics15(5), 362-369.

    • Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear.

  • URL

    • http://www5.informatik.uni-erlangen/HTML/English/Research/Promoter


Submissions cont123 l.jpg
Submissions (cont.) approaches

  • Computational Biosciences, Oakridge National Laboratory

    • Credit

      • Richard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah, Morey Parang

    • Method

      • Integrated neural network based system including gene assembly using EST and homology information (GRAILexp).

    • URL:

      • http://compbio.ornl.gov/droso


Submissions cont124 l.jpg
Submissions (cont.) approaches

  • Center for Biological Sequence Analysis, Technical University of Denmark

    • Credit

      • Anders Krogh

    • Method

      • Modular HMM incorporating database hits (proteins and ESTs/cDNAS) and other “external information” probabilistically (HMMGene); the HMM has modules for coding regions, splice sites, translation start/stop, etc..

      • It will be a fully automated system.

      • Trained on Drosophila data

        • http://www.fruitfly.org/GSAC1/data/data.html

      • and

        • Victor Solovyev (personal communication)


Submissions cont125 l.jpg
Submissions (cont.) approaches

  • References

    • Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in Molecular Biology, 45-63, Elsevier.

    • Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97, 179-186.

    • http://www.cbs.dtu.dk/krogh/refs.html

  • URL

    • http://www.cbs.dtu.dk/services/HMMgene/

    • Not yet for Drosophila.


Submissions cont126 l.jpg
Submissions (cont.) approaches

  • BLOCKS group, Fred Hutchinson Cancer Research Center in Seattle, Washington

    • Credit

      • Jorja Henikoff, Steve Henikoff

    • Method

      • DNA translation in 6 frames and search against BLOCKS+ and against BLOCKS extracted from Smart3.0 (http://coot-embl-heidelberg.de/SMART/) using BLIMPS; automatic post-processing to join multiple predictions from the same block.

      • Automatic with some user interactive screening of results.


Submissions cont127 l.jpg
Submissions (cont.) approaches

  • References

    • Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27, 226-228.

    • Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On System Sciences, 265-274.

    • Henikoff and Henikoff (1994), Genomics, 19, 97-107.

  • URL

    • http://blocks.fhcrc.org

    • http://blocks.fhcrc.org/blocks-bin/getblock.sh?<block name>


Submissions cont128 l.jpg
Submissions (cont.) approaches

  • Genome Informatics Team, IMIM, Barcelona, Spain

    • Credit

      • Roderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis Parra

    • Method

      • Dynamic programming based system to combine potential exon candidates modeled as a fifth order Markov model and functional sequence sites modeled as a position weight matrix (Geneid version 3).

      • Fully automatic, very fast.

      • Trained on Drosophila data

        • http://www.fruitfly.org/GSAC1/data/data.html


Submissions cont129 l.jpg
Submissions (cont.) approaches

  • References

    • Guigó et al. (1998), JCB , 5, 681-702.

  • URL

    • Information on training process:

      • http://www1.imim.es/~rguigo/AnnotationExperiment/index.html

    • http://www1.imim.es/geneid.html


Submissions cont130 l.jpg
Submissions (cont.) approaches

  • Mark Borodovsky's Lab, School of Biology, Georgia Institute of Technology

    • Credit

      • Mark Borodovsky, John Besemer

    • Method

      • Markov chain models combined with HMM technology (Genemark.hmm).

    • URL

      • http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html


Submissions cont131 l.jpg
Submissions (cont.) approaches

  • Biodivision, GSF Forschungszentrum für Umwelt und Gesundheit, Neuherberg, Germany

    • Credit

      • Matthias Scherf, Andreas Klingenhoff, Thomas Werner

    • Method

      • Universal sequence classifier which is based on a correlated word analysis to predict initiators and promoter associated TATA boxes (CoreInspector V1.0 beta). Sequences of 100 bp are classified at once.

      • Trained on Eukaryotic Promoter Database (EPD version 5.9).

      • Fully automatic, 2 seconds per 1Kb.

    • References

      • Scherf et al. (1999), in preparation.

    • URL

      • http://www.gsf.de/biodv/


Submissions cont132 l.jpg
Submissions (cont.) approaches

  • The Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York

    • Credit

      • Gary Benson

    • Method

      • Tandem repeats finder (TRF v2.02) uses theoretical model of the similarity between adjacent copies of pattern (pattern from 1 -500 bp recognized); dynamic programming for candidate validation.

      • Fully automatic; very fast (seconds per 1Mb).

      • http://c3.biomath.mssm.edu/trf/Adh.fa.2.7.7.80.10.50.500.1.html

    • References

      • Benson (1999), Nucl. Acids Res., 27(2), 573-580.

    • URL

      • http://c3.biomath.mssm.edu/trf.html


Submissions cont133 l.jpg
Submissions (cont.) approaches

  • Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc.

    • Credit

      • Martin G. Reese, David Kulp, Hari Tammana, David Haussler

    • Method

      • Generalized hidden Markov model with optional integration of EST hits and homology searches (Genie).

      • Trained on Drosophila data

        • http://www.fruitfly.org/GSAC1/data/data.html

      • Semi-automatic, in that the overlaps of the analyzed sequence contigs (110kb) where manual run again with Genie to resolve conflicts.

      • BLAST used for homology searches on non-redundant protein database (nr).


Submissions cont134 l.jpg
Submissions (cont.) approaches

  • References

    • Reese et al. (1997), JCB, 4(3), 311-323.

    • Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference, 232-244.

    • Kulp et al. (1996), ISMB, 4, 134-142.

  • URL

    • http://www.neomorphic.com/genie


Submission classes l.jpg
Submission classes approaches




Measuring success l.jpg
Measuring success approaches

  • By nucleotide

    • Sensitivity/Specificity (Sn/Sp)

  • By exon

    • Sn/Sp

    • Missed exons (ME), wrong exons (WE)

  • By gene

    • Sn/Sp

    • Missed genes (MG), wrong genes (WG)

    • Average overlap statistics

  • Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3), 353-367.


Definitions and formulae l.jpg
Definitions and formulae approaches

Sn = TP/(TP+FN)

Sp = TP/(TP+FP)

  • TP = True positive

  • FP = False positive

  • FN = False negative





Toy example 1 1 l.jpg
Toy example 1 (1) approaches

Sn = TP/(TP+FN)

Sp = TP/(TP+FP)




Toy example 1 2 l.jpg
Toy example 1 (2) approaches

Sn = TP/(TP+FN)

Sp = TP/(TP+FP)


Genes std 1 versus std 3 l.jpg
Genes: Std 1 versus Std 3 approaches

Std1: “conservative gene set”

Std3: “more complete gene set”


Toy example 1 3 l.jpg
Toy example 1 (3) approaches

Sn = TP/(TP+FN)

Sp = TP/(TP+FP)



Toy example 1 4 l.jpg
Toy example 1 (4) approaches





Definition joined and split genes l.jpg
Definition: “Joined” and “split” genes approaches

# Actual genes that overlap predicted genes

JG = -------------------------------------------

  • JG > 1, tendency to join multiple actual genes into one prediction

  • SG > 1, tendency to split actual genes into separate gene predictions

# Predicted genes that overlap one or more actual genes

# Predicted genes that overlap actual genes

SG = -------------------------------------------

# Actual genes that overlap one or more predicted genes

Inspired by Hayes and Guigó (1999), unpublished.


Toy example 2 1 l.jpg
Toy example 2 (1) approaches


Annotation experiment results l.jpg
Annotation experiment results approaches

  • Results available during tutorial and at

http://www.fruitfly.org/GASP1/results/


Results base level l.jpg
Results: Base level approaches

  • Sensitivity:

    • Low variability among predictors

    • ~95% coverage of the proteome

  • Specificity

    • ~90%

    • Programs that are more like Genscan (used for original annotation) might do better?


Results exon level l.jpg
Results: Exon level approaches

  • Higher variability among predictors

  • Up to ~75% sensitivity (both exon boundaries correct)

  • 55% specificity

  • Low specificity because partial exon overlaps do not count

  • Missing exons below 5%

  • Many wrong exons (~20%)



Results gene level160 l.jpg
Results: Gene level approaches

  • 60% of actual genes predicted completely correct

  • Specificity only 30-40%

  • 5-10% missed genes (comparable to Sanger Center)

  • 40% wrong genes, a lot of short genes over-predicted (possibly not annotated in Standard 3)

  • Splitting genes is a bigger problem than joining genes


Results protein homology base level l.jpg
Results (protein homology): approachesBase level


Results protein homology exon level l.jpg
Results (protein homology): approachesExon level


Results protein homology gene level l.jpg
Results (protein homology): approachesGene level



Tss standard 3 l.jpg
TSS: Standard 3 approaches


Results tss recognition l.jpg
Results: approachesTSS recognition


Interesting gene examples bubblegum l.jpg
Interesting gene examples: approachesbubblegum


Adh adhr alcohol dehydrogenase adh related l.jpg
Adh/Adhr ( approachesAlcohol dehydrogenase/Adh related)


Adh adhr cont l.jpg
Adh/Adhr approaches(cont..)


Osp outspread l.jpg
osp approaches(outspread)

  • Contains Adh and Adhr embedded in an intron


Cact cactus l.jpg
cact ( approachescactus)


Kuz kuzbanian l.jpg
kuz approaches (kuzbanian)


Beat beaten path l.jpg
beat approaches (beaten path)


Idfg1 idfg2 idfg3 imaginal disc growth factor l.jpg
Idfg1, Idfg2, Idfg3 approaches (Imaginal Disc Growth Factor)


Idfg1 idfg2 idfg3 cont l.jpg
Idfg1, Idfg2, Idfg3 approaches(cont.)

  • Chitinase-related

  • Gene function has changed (now a growth factor)


Conclusion of gasp1 l.jpg
Conclusion of GASP1 approaches

  • 95% coverage of the proteome

  • Base level prediction is easier, exon level prediction is harder

  • Small genes over predicted (?)

  • Long introns

  • The high number of “wrong genes” indicates possible incomplete annotation in Standard 3 (Are there more genes?)

  • HMM seems to currently be the best approach

  • Major improvements in multiple gene regions


Conclusion gasp1 cont l.jpg
Conclusion GASP1 (cont.) approaches

  • Much lower false positive rates

  • Methods optimized for organism of interest do better

  • Gene finding including homology not always improves prediction

  • Split genes is more of a problem than joined genes

  • No program is perfect


Discussion gasp1 l.jpg
Discussion GASP1 approaches

  • Genes in introns

  • Alternative splicing

  • Genomic contamination in cDNA libraries

  • Translation start prediction

  • Biological verification of prediction needed

    • Improve test bed by cDNA sequencing

    • More regulation data needed to confirm promoter assessment

  • Combining methods

  • Better methods needed

  • GASP 2 ?


Conclusions on annotating complete eukaryotic genomes l.jpg
Conclusions on annotating complete eukaryotic genomes approaches

  • Throughput has to improve dramatically

  • Not only genes but also their relationships have to be elucidated

  • Complete transcript cDNAs very powerful tool for annotation including alternative transcripts

  • Comparative genomics as well as expression analysis improves/completes genome annotation

  • Standardization efforts needed (ontology working group, OMG, OiB, NCBI/EBI, Bioxml, etc.)

    • Standards for description of gene products

    • Exchange format (GFF, Genbank, EMBL, XML)


Conclusions on annotating complete eukaryotic genomes cont l.jpg
Conclusions on annotating complete eukaryotic genomes (cont.)

  • Maintenance requires even more effort than the original development

  • Automated methods are not good enough

  • Human curators can cause problems too

  • Functional assignment by homology is sometimes unreliable


Discussion on annotating complete eukaryotic genomes l.jpg
Discussion (cont.)on annotating complete eukaryotic genomes

  • Re-annotation: updating results and annotations over time

    • Genomic sequence changes (indels, point mutations)

    • Analysis software changes

    • New entries in public sequence databases

    • Entries removed from sequence databases

  • Audit trail for annotations

  • Master copy of genome annotations should reside in the model organism databases where the expertise resides

  • Community collaborative annotation


Acknowledgments l.jpg
Acknowledgments (cont.)

  • Uwe Ohler (University of Erlangen, Germany)

  • Gerry Rubin (UC Berkeley)

  • Sima Misra (UC Berkeley)

  • Erwin Frise (UC Berkeley)

  • Roderic Guigó (Barcelona)

  • GFF team (headed by Richard Bruskiewich, Sanger Centre)

  • Assessment team: Michael Ashburner (EBI), Peer Bork (EMBL), Richard Durbin (Sanger), Roderic Guigó (Barcelona), Tim Hubbard (Sanger)

  • Annotation experiment participants


ad