The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

The challenge of annotating a complete eukaryotic genome:A case study in Drosophila melanogaster Martin G. Reese (mgreese@lbl.gov) Nomi L. Harris (nlharris@lbl.gov) George Hartzell (hartzell@cs.berkeley.edu) Suzanna E. Lewis (suzi@fruitfly.berkeley.edu) Drosophila Genome CenterDepartment of Molecular and Cell Biology539 Life Sciences AdditionUniversity of California, Berkeley

Abstract Many of the technical issues involved in sequencing complete genomes are essentially solved. Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem. How shall the genomes be annotated, what shall be annotated, which computational tools are most effective, how reliable are these annotations, how organism-specific do the tools have to be and ultimately how should the computational results be presented to the community? All these questions are unsolved. This tutorial will give an overview and assessment of the current state of annotation based upon experiences gained at the Drosophila melanogaster genome project. In the tutorial we will do three things. First, we will break down the annotation process and discuss the various aspects of the problem. This will serve to clarify the term "annotation", which is often used to collectively describe a process that has a number of discrete steps. Second, with the participation of computational biologists from the community we will compare existing tools for sequence annotation. We will do this by providing a 3 megabase sequence that has already been well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is similar to what has been done at the CASP (critical assessment of techniques for protein structure prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we will discuss which annotation problems are essentially solved and which problems remain.

Tutorial goals • Review the algorithms currently used in annotation • Assess existing methods under “field” conditions • Identify open issues in annotation

Tutorial organization • Definitions • Annotation • “Biological” issues • “Engineering” issues • Application of tools within an existing annotation system • Break (20 minutes) • Review of existing tools • Our annotation experiment • Conclusions and outstanding issues

What is a gene? • Definition: An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn have an influence on some characteristic phenotype of the organism.

What are annotations? • Definition: Features on the genome derived through the transformation of raw genomic sequences into information by integrating computational tools, auxiliary biological data, and biological knowledge.

How does an annotation differ from a gene? • Many annotations are the same as ‘genes’ • The annotation describes an inheritable trait associated with a region of DNA. • But an annotation may not always correspond in this way, e.g. an STS, or sequence overlap • Region of genomic DNA or RNA is not translated or transcribed

Transcription and translation

Schematic gene structure

Sequence feature types • Transcribed region • mRNA, tRNA, snoRNA, snRNA, rRNA • Structural region • Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product • Mutations: insertion, deletion, substitution, inversion, translocation • Functional or signal region • Promoter, enhancer, DNA/RNA binding site, splice site signal, poly-adenylation signal • Protein processing: glycosylation, methylation, phosphorylation site • Similarity • Homolog, paralog, genomic overlap (syntenic region) • Other feature types • Transposable element, repetitive element • Pseudogene • STS, insertion site

DNA transcription unit features • Promoter elements • Core promoter elements • TATA box • Initiator (Inr) • Downstream promoter element (DPE) • Transcription factor (“TF”) binding sites • CAAT boxes • GC boxes • SP-1 sites • GAGA boxes • Enhancer site(s)

mRNA features • Exon • Initial, internal, terminal • Codon usage, preference • Control elements (e.g. splice enhancers) • Intron • 5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”) • Repeat elements • Start codon (translation start site) • “Kozak” rule • UTR (untranslated regions) • 5’ UTR • Translation regulatory elements • RNA binding sites • Initial, internal, terminal • Control elements (e.g. splice enhancers) • 3’ UTR • RNA binding sites (cis-acting elements) • Stop codon • Poly-adenylation signal and site • RNA destabilization signal

Definitions for data modeling • Feature: An interval or an ordered set of intervals on a sequence that describes some biological attribute and is justified by evidence. • Sequence: A linear molecule of DNA, RNA or amino acids. • Evidence: A computational or experimental result coming out of an analysis of a sequence • Annotation: A set of features

Detailed analysis (typically biological) of single genes Large-scale analysis (typically computational) of entire genome Annotation Annotated genome Depth of knowledge Breadth of knowledge

Annotation process overview Methods Data Genome Sequence Auxiliary Data Computational Tools Database Resources Annotation Systems Understanding of a Genome

Types of sequence data • Chromosomal sequence • Euchromatic • Heterochromatic • mRNA sequences • Full length cDNA • 5’ EST • 3’ EST • Protein sequences • Insertion site flanking sequences

Auxiliary data • Maps • Genetic, physical, radiation hybrid map (RH), deletion, cytogenetic • Expression data • Tissue, stage • Phenotypes • Lethality, sterility

Computational annotation tools • Gene finding • Repeat finding • EST/cDNA alignment • Homology searching • BLAST, FASTA, HMM-based methods, etc. • Protein family searching • PFAM, Prosite, etc.

Database resources • Curated sequence feature data sets • Repeat elements • Transposons • Non-redundant mRNA • STSs and other sequence markers • Genome sequence from related species • D. melanogaster vs. D. virilis, D. hydei • Genome sequence from more distant species • Protein sequences from distant species

Biological issues in annotation • Common • Genes within genes • Alternative splicing • Alternative poly-adenylation sites • Rare • Translational frame shifting • mRNA editing • Eukaryotic operons • Alternative initiation

Engineering issues in annotation • What sequence to start with? • Because features are intervals on a sequence, problems can be caused by gaps, frameshifts, and other changes to the sequence. How do you track these changes over time and model features that span gaps? • When to annotate? • Feature identification can aid in sequencing. It may be advisable to carry out sequencing and annotation in parallel thus enabling them to complement one another. • What analyses need to be run and how? • What dependencies are there between various analysis programs? • What parameters settings to use?

Engineering issues in annotation • What public sequence data sets are needed? • What are the mechanics of obtaining public sequence databases? • Are curated data sets available or do you need to set up a means of maintaining your own (for repeats, insertions, organism of interest) • How do you achieve computational throughput? • Workstation farm, or simply a big, powerful box? • Job flow control • What do you do with the results? • Homogenize results into single format? • Filter results for significance and redundancy

Engineering issues in annotation • Interpreting the results • Is human curation needed? • How can you achieve consistency between curators? • How do you design the user interface so that it is simple enough to get the task completed speedily but complex enough to deal with biology? • How do you capture curations? • How are annotation translations to be described? • EC terminology • ProSite families • Pfam domains • Is function distinguishable from process?

Engineering issues in annotation • How do you manage data? • What is the appropriate database schema design? • How is the database to be kept up to date? Will it be directly from programs running user interfaces and analyses or via a middleware layer? • Is a flat file format needed and what should it be? • What query and retrieval support is needed? • How do you distribute data? • For bulk downloads what is the format of the data? • What information is best summarized in tables? • What information requires an integrated graphical view?

Engineering issues in annotation • How do you update the annotations? • How frequently are they re-evaluated? • How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)? • How can differences between old and new computational results be detected? • Changes in computational results may need to trigger changes in curated annotations

Drosophila melanogaster • Drosophila is the most important model organism* • Drosophila genome: • 4 chromosomes • 180 Mb total sequence • 140 Mb euchromatic sequence • 12-14,000 genes * source: G.M. Rubin

Drosophila Genome Project • Laboratories working on Drosophila sequencing: • BDGP (Berkeley Drosophila Genome Project) • EDGP (European Drosophila Genome Project) • Celera Genomics Inc. • “Complete” D. melanogaster sequence will be finished by the end of 1999 • Comprehensive database - FlyBase

Goals of the Drosophila Genome Project • Complete genome sequence • Structure of all transcripts • Expression pattern of all genes • Phenotype resulting from mutation of all ORFs • And more...

Sequencing at the BDGP • Genomic sequence • P1 and BAC clones • 24Mb of completed sequence (as of July 22, 1999) • 18Mb unfinished sequence in process • Complete tiling path in BACs • 1.5x-path draft sequencing • ESTs and cDNAs • 80,942 ESTs finished (as of March 19, 1999) • Over 800 full-length cDNAs

The BDGP sequence annotation process

What sequence to start with? • Unit of sequencing at the BDGP • Completed high-quality clone sequences • Reassembling the genomic sequence • Need to place clones in correct genomic positions • Need to integrate genes that span multiple clones • Solved by using genomic overlaps to reconstitute full genomic sequence

Which analyses need to be run? • Similarity searches • BLAST (Altschul et al., 1990) • BLASTN (nucleotide databases) • BLASTX (amino acid databases) • TBLASTX (amino acid databases, six-frame translation) • sim4 (Miller et al., 1998) • Sequence alignment program for finding near-perfect matches between nucleotide sequences containing introns • Gene predictors • Genefinder (Green, unpublished) • GenScan (Burge and Karlin, 1997) • Genie (Reese et al., 1997) • Other analyses • tRNAscanSE (Lowe and Eddy, 1996)

Which analyses need to be run and how? • mRNAs • ORFFinder(Frise, unpublished) • Protein translations • HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer et al. 1997, Bateman et al. 1999) • Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with EMOTIF ( Nevill-Manning et al. 1998) • Psort II (Horton and Nakai 1997) • ClustalW (Higgins et al. 1996)

What public sequence data sets are needed? • Automating updates of public databases: • Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP • Curated data sets • D. melanogaster genes (FlyBase) • Transposable elements (EDGP) • Repeat elements (EDGP) • STSs (BDGP)

Which analyses need to be run and how?

How do you achieve computational throughput? • BDGP computing power • Sun Ultra 450 (3 machines, 4 processors each) • Sun Enterprise (1 machine, 8 processors) • Used these directly, without any system for distributed computing. • Job flow control: the Genomic Daemon • Automatic batch analysis of genomic clones • Berkeley Fly Database is used for queuing system and storage of results • Many clones can be analyzed simultaneously • Results are processed and saved in XML format for interactive browsing

What do you do with the results? • Berkeley Output Parser (BOP) • Input to BOP: • Genomic sequence • Results of computational analyses • Filtering preferences • Parses results from BLAST, sim4, GeneFinder, GenScan, and tRNAscan-SE analyses • Filters BLAST and sim4 results • Eliminates redundant or insignificant hits • Merges hits that represent single region of homology • Homogenizes results into single format • Output: sequence and filtered results in XML format

Is human curation needed? • Not for everything • Some features are obvious and can be identified computationally • Known D. melanogaster genes are detected automatically by GeneSkimmer • Repetitive elements • But still for many things • Annotating complete gene structure is still hard • We use CloneCurator (BDGP’s Java graphical editor) for curation

Gene Skimmer • Quick way of identifying genes in new sequence before curation • Start with XML output from BOP • Look for sim4 hits with known Drosophila genes • Find gene hits with sequence identity >98%, coverage >30% • Verify that hits represent real genes

Gene Skimmer URL: http://www.fruitfly.org/sequence/genomic-clones.html

CloneCurator • Displays computational results and annotations on a genomic clone • Interactive browsing • Zoom/scroll • Change cutoffs for display of results • Analyze GC content, restriction sites, etc. • Interactive annotation editing • Expert “endorses” selected results • Presents annotations to community via Web site

How do we annotate gene/protein function? • Gene Ontology Project • Controlled hierarchical vocabulary for multiple-genome annotations and comparisons • Standardized vocabulary facilitates collaboration • Good data modeling allows better database querying • Ontology browser provides interactive search of hierarchical terms • “GO” project (http://www.ebi.ac.uk/~ashburn/GO)

Ontology browser

Ontology browser: searching for terms

How do you distribute the data? • Bulk downloads • FASTA at http://www.fruitfly.org/sequence/download.html • Curated data sets • Tabular data • Athttp://www.fruitfly.org/sequence/ • Sequenced genomic clones • Clone contigs sorted by genomic location • Clone contigs sorted by size • Ribbon provides integrated graphical view of annotations on physical contigs

Ribbon • Human curator annotates individual clones (~100Kb) • Clones are assembled into physical contigs (regions of physical map) • Clone annotations are merged and renumbered for display on whole physical contigs • Ribbon is our Java display tool for displaying curated annotations on physical contigs • Will soon be available on Web

Ribbon

The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster