GUS The G enomics U nified S chema A Platform for Genomics Databases

GUSThe Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert Center for Bioinformatics, University of Pennsylvania stevef,stoeckrt@pcbi.upenn.edu

Overview

Abstract The Genomics Unified Schema (GUS) is a strongly typed relational database schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.

Goals of GUS • Generic platform for model organism or disease specific databases • Freely available at www.gusdev.organd www.cbil.upenn.edu • Integration of genome, transcript and protein data, including: • Sequence • Function • Expression • Interaction • Regulation • Orthologs and paralogs • Support for: • automated annotation and integration • manual curation • data mining/analysis and sophisticated queries • web access

DoTS RAD TESS SRES Core GUS Powers Multiple Genomics DBs AllGenes PlasmoDB EPConDB Java Servlets Oracle RDBMS Other sites, Other projects Object Layer for Data Loading

Components of GUS • Relational database schema • Lightweight object layer • Application frameworks • Data access • Pipeline/workflow • Web (servlets) • Applications • Annotator’s interface • Parsers and exporters (using standards) • Annotation and analysis programs • Schema browser • Utilizes Oracle 9i

AutomatedAnalysis &Integration Annotator’s Interface WWW queries,browsing, & download Mining Applications Architecture of GUS QTL,POP, SNP, Clinical GenBank, InterPro, GO, etc GenomicSequence microarray& SAGEExperiments GSSs &ESTs MappingData Annotation Object Layer Oracle/SQL DoTS TESS RAD Core SRes Java Servlets &Perl CGI

Usage of GUS • Annotation • Of genomes: gene models, sequence features • Of genes: function, expression, regulation • Integration • From sequence to expression • Map identifiers to/from external databases • Data mining, creating curated datasets • Algorithm-based: GO function prediction • Genome-wide querying: find all pancreas-specific transcripts • PANCchip: non-redundant genes expressed in pancreas found using ESTs, microarrays and cDNA libraries

GUS Schema

Schema features • Extensive integrated genomics schema (300 tables) • Divided into 5 distinct domains • Highly normalized • Strongly typed • Controlled vocabularies used extensively • Avoid using name-value pairs • Subclassing • Use views of superclass to define subclasses • Useful for mapping into the object layer • Warehousing • Include databases such as Genbank, GO terms, Prodom, CDD. • Facilitates management of value-added annotation across updates • Cross references to external databases • Tracking and versioning

Namespace Domain Highlights Core Data Provenance Evidence SRes (Shared Resources) Shared Resources Ontologies Sequence and annotation DoTS (DB of Transcribed Seqs) Central dogma RAD (RNA Abundance DB) Gene expression MIAME/MAGE TESS (Trans Elem Search Site) Gene regulation Grammars Five domains GUS is divided into 5 domains* (separate name spaces) * Protein interaction domain underway

Arrays • SAGE • Conditions Transcript Expression • Characterize transcripts • RH mapping • Library analysis • Cross-species analysis • DOTS assemblies Transcribed Sequence • Domains • Function • Structure • Cross-species analysis Protein Sequence • Binding Sites • Patterns • Grammars Gene Regulation Querying across the domains Core DoTS • Ownership • Protection • Algorithms • Versioning • Workflows Data Provenance • Genes, gene models • STSs, repeats, etc • Cross-species analysis Genomic Sequence RAD SRes • GO • Species • Anatomy/Tissue • Developmental stage • Disease state Ontologies TESS SRes RAD DoTS "Transcription factors upregulated in acute myeloid leukemia with sequence similarity to c-fos and common promoter motifs" Core TESS

DoTS central dogma schema Gene Gene Instance Gene Feature (isa NA Feature) Genomic Sequence (isa NA Sequence) RNA RNA Instance RNA Feature (isa NA Feature) RNA Sequence (isa NA Sequence) Protein Protein Instance Protein Feature (isa NA Feature) Protein Sequence (isa AA Sequence)

RAD schema uses MAGE/MIAME MAGE Experiment Array BioMaterial BioAssay BioAssayData Protocol, Descr. HigherLevelAnalysis MIAME Experimental Design Array design Samples Hybridization, Measure Normalization .

DoTS.NaFeature BindingSite Promoter . . . TESS schema TESS.Moiety Moiety MoietyHeterodimer MoietyMultimer MoietyComplex TESS.Activity ActivityProteinDnaBinding TESS.FootprintInstance ActivityTissueSpecificity TESS.TrainingSet TESS.Model DoTS.NaSequence ModelString TESS.ParameterGroup ModelConsensusString ModelPositionalWeightMatrix TESS.Note ModelGrammar

Ontologies and vocabularies • Ontologies • Gene Ontology (GO) • Sequence Ontology (SO) (sequence features) • Phenotype and Trait Ontology (PATO) • Taxon (NCBI) • Anatomy (Penn) • Disease (ICD9) • Developmental stage (multiple sources) • And vocabularies • External database names • Genetic codes • Review status

Evidence trail • Evidence and tracking • Data tables have columns for user, date, project, algorithm invocation • Tables dedicated to algorithm, algorithm version and parameters • 176 algorithms, including public and in-house • Tracks automated and manual annotation, similarity and integration • Versioning • All updated or deleted rows are copied to version table

Sophisticated queries • Sample queries from three projects that utilize GUS’s data integration and analysis • www.allgenes.org • “Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have been localized to mouse chromosome 5?” • http://plasmodb.org • “List all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage” • www.cbil.upenn.edu/EPConDB • “Which genes on chromosome 2 are expressed in pancreas and are involved in signal transduction based on GO function assignments.”

Application Frameworks

GUS Object layer • Lightweight Perl implementation • Java on the way • One object per table • Parent/child relationships • Cascading delete

Data input • The GusApplication program manages inserts and updates to GUS, handling tracking and versioning. • Specific tasks are implemented as plugins. • Plugins use either GUS objects or SQL access. • Low-level database access is provided by DBI classes. GusApplication SQL Plugin Object SuperClasses Object Core SRes DBI Object DoTS RAD TESS Object Object

Pipeline • Perl API for defining annotation pipelines • Supports sequential protocols • Distributes compute intensive work to compute cluster • Used for 90 stage pipeline to build DoTS transcript index

Web • Servlets and cgi based design (JSP on the way) • Automatic generation of HTML FORMs • Automated input checking • Integrated help features • INPUT elements populated from the database • Query history facility • Boolean queries (AND, OR, SUBTRACT) • Declarative configuration file • Base system is relatively independent of GUS

Provided Applications

Annotator’s interface Assign Gene Name/Symbol Assign Gene Description Assign Gene Synonym(s) Evidence

Parsing & exporting • Parsing • Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR • Protein Motifs: CDD, Prodom, InterPro • Expression: MAGE • Ontologies: GO, SO, PATO • Mapping data: RH maps • Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder • Similarity: BLAST, BLAT, Sim4 • CAP4 • Exporting • FASTA • MAGE • Table dumps • DoTS Assemblies

Analysis & annotation • GO functional assignment • Expression analysis (PaGE) • Anatomy classification • Library distribution • Genes from BLAT of DoTS against genome • DoTS assembly and annotation • Refresh warehouse • Cluster and assemble mRNAs/ESTs into putative transcripts • Annotate transcripts through similarity, GO function and markers • Integrate previously existing manual curation

DoTS Pipeline Genomic Sequence mRNA/EST Sequence Clustering and Assembly Gene predictions GenScan/ HMMer, PHAT SIM4 or BLAT Predicted Genes DoTS consensus Sequences Merge Genes Gene/RNA cluster assignment Annotate DoTS Manual Annotation Tasks Gene Index framefinder RNAs Proteins translation BLASTX PFAM, Smart, ProDom BLASTP Other computed annotation (EPCR, AssemblyAnatomyPercent, Index Key Words, SNP analysis) BLAST Similarities Functional predictions Protein Motifs GO Functions

References & Acknowledgements • References • Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes ResearchDiabetes 51: 1997-2004, 2002. • Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655. • Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90 • Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001. • Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78. • Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531. • Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757. • Acknowledgements • NIH grant RO1-HG-01539-03 • DOE grant DE-FG02-00ER62893 • Burroughs Wellcome Fund • NIDDK 56947 and 56954 with cosponsorship from the JDFI

Related posters • 114A. Web-Based Biological Discovery using the GUS Integrated Database. • 170A. TESS-II:Describing and Finding Gene Regulatory Sequences with Grammars • 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?

GUS The G enomics U nified S chema A Platform for Genomics Databases

GUS The G enomics U nified S chema A Platform for Genomics Databases

Presentation Transcript

S U N G L A S S E S

U nified Communications Administration Experience

G U S T A R

GUS: A Functional Genomics Data Management System

S M E U nified L ending O pportunities for N ational G rowth

M elbourne U nified S ymphony O rchestra

“It’s all about US – A U nified S taff”

G rand U nified R elational D atabase

S A G U A R O

U NIFIED A CCOUNT C ODE S TRUCTURE

U nified

b u g s

W isconsin G enomics I nitiative

GUS The G enomics U nified S chema A Platform for Genomics Databases

U NIFIED A CCOUNTS C ODE S TRUCTURE

U NIFIED A CCOUNTS C ODE S TRUCTURE