gus the g enomics u nified s chema a platform for genomics databases n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
GUS The G enomics U nified S chema A Platform for Genomics Databases PowerPoint Presentation
Download Presentation
GUS The G enomics U nified S chema A Platform for Genomics Databases

Loading in 2 Seconds...

play fullscreen
1 / 30

GUS The G enomics U nified S chema A Platform for Genomics Databases - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

GUS The G enomics U nified S chema A Platform for Genomics Databases. V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'GUS The G enomics U nified S chema A Platform for Genomics Databases' - elijah-donovan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gus the g enomics u nified s chema a platform for genomics databases

GUSThe Genomics Unified Schema A Platform for Genomics Databases

V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert

Center for Bioinformatics, University of Pennsylvania

stevef,stoeckrt@pcbi.upenn.edu

abstract
Abstract

The Genomics Unified Schema (GUS) is a strongly typed relational database schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.

goals of gus
Goals of GUS
  • Generic platform for model organism or disease specific databases
  • Freely available at www.gusdev.organd www.cbil.upenn.edu
  • Integration of genome, transcript and protein data, including:
    • Sequence
    • Function
    • Expression
    • Interaction
    • Regulation
    • Orthologs and paralogs
  • Support for:
    • automated annotation and integration
    • manual curation
    • data mining/analysis and sophisticated queries
    • web access
gus powers multiple genomics dbs

DoTS

RAD

TESS

SRES

Core

GUS Powers Multiple Genomics DBs

AllGenes

PlasmoDB

EPConDB

Java Servlets

Oracle RDBMS

Other sites,

Other projects

Object Layer for Data Loading

components of gus
Components of GUS
  • Relational database schema
  • Lightweight object layer
  • Application frameworks
    • Data access
    • Pipeline/workflow
    • Web (servlets)
  • Applications
    • Annotator’s interface
    • Parsers and exporters (using standards)
    • Annotation and analysis programs
  • Schema browser
  • Utilizes Oracle 9i
slide7

AutomatedAnalysis &Integration

Annotator’s Interface

WWW queries,browsing, & download

Mining

Applications

Architecture of GUS

QTL,POP,

SNP, Clinical

GenBank, InterPro, GO, etc

GenomicSequence

microarray& SAGEExperiments

GSSs &ESTs

MappingData

Annotation

Object Layer

Oracle/SQL

DoTS

TESS

RAD

Core

SRes

Java Servlets &Perl CGI

usage of gus
Usage of GUS
  • Annotation
    • Of genomes: gene models, sequence features
    • Of genes: function, expression, regulation
  • Integration
    • From sequence to expression
    • Map identifiers to/from external databases
  • Data mining, creating curated datasets
    • Algorithm-based: GO function prediction
    • Genome-wide querying: find all pancreas-specific transcripts
    • PANCchip: non-redundant genes expressed in pancreas found using ESTs, microarrays and cDNA libraries
schema features
Schema features
  • Extensive integrated genomics schema (300 tables)
  • Divided into 5 distinct domains
  • Highly normalized
  • Strongly typed
    • Controlled vocabularies used extensively
    • Avoid using name-value pairs
  • Subclassing
    • Use views of superclass to define subclasses
    • Useful for mapping into the object layer
  • Warehousing
    • Include databases such as Genbank, GO terms, Prodom, CDD.
    • Facilitates management of value-added annotation across updates
  • Cross references to external databases
  • Tracking and versioning
five domains

Namespace

Domain

Highlights

Core

Data Provenance

Evidence

SRes

(Shared Resources)

Shared Resources

Ontologies

Sequence and annotation

DoTS

(DB of Transcribed Seqs)

Central dogma

RAD

(RNA Abundance DB)

Gene expression

MIAME/MAGE

TESS

(Trans Elem Search Site)

Gene regulation

Grammars

Five domains

GUS is divided into 5 domains* (separate name spaces)

* Protein interaction domain underway

slide12

Arrays

  • SAGE
  • Conditions

Transcript

Expression

  • Characterize transcripts
  • RH mapping
  • Library analysis
  • Cross-species analysis
  • DOTS assemblies

Transcribed

Sequence

  • Domains
  • Function
  • Structure
  • Cross-species analysis

Protein

Sequence

  • Binding Sites
  • Patterns
  • Grammars

Gene Regulation

Querying across the domains

Core

DoTS

  • Ownership
  • Protection
  • Algorithms
  • Versioning
  • Workflows

Data Provenance

  • Genes, gene models
  • STSs, repeats, etc
  • Cross-species analysis

Genomic

Sequence

RAD

SRes

  • GO
  • Species
  • Anatomy/Tissue
  • Developmental stage
  • Disease state

Ontologies

TESS

SRes

RAD

DoTS

"Transcription factors upregulated in acute myeloid leukemia

with sequence similarity to c-fos and common promoter motifs"

Core

TESS

dots central dogma schema
DoTS central dogma schema

Gene

Gene

Instance

Gene

Feature

(isa NA Feature)

Genomic

Sequence

(isa NA Sequence)

RNA

RNA

Instance

RNA

Feature

(isa NA Feature)

RNA

Sequence

(isa NA Sequence)

Protein

Protein

Instance

Protein

Feature

(isa NA Feature)

Protein

Sequence

(isa AA Sequence)

slide14

RAD schema uses MAGE/MIAME

MAGE

Experiment

Array

BioMaterial

BioAssay

BioAssayData

Protocol, Descr.

HigherLevelAnalysis

MIAME

Experimental Design

Array design

Samples

Hybridization, Measure

Normalization

.

tess schema

DoTS.NaFeature

BindingSite

Promoter

. . .

TESS schema

TESS.Moiety

Moiety

MoietyHeterodimer

MoietyMultimer

MoietyComplex

TESS.Activity

ActivityProteinDnaBinding

TESS.FootprintInstance

ActivityTissueSpecificity

TESS.TrainingSet

TESS.Model

DoTS.NaSequence

ModelString

TESS.ParameterGroup

ModelConsensusString

ModelPositionalWeightMatrix

TESS.Note

ModelGrammar

ontologies and vocabularies
Ontologies and vocabularies
  • Ontologies
    • Gene Ontology (GO)
    • Sequence Ontology (SO) (sequence features)
    • Phenotype and Trait Ontology (PATO)
    • Taxon (NCBI)
    • Anatomy (Penn)
    • Disease (ICD9)
    • Developmental stage (multiple sources)
  • And vocabularies
    • External database names
    • Genetic codes
    • Review status
evidence trail
Evidence trail
  • Evidence and tracking
    • Data tables have columns for user, date, project, algorithm invocation
    • Tables dedicated to algorithm, algorithm version and parameters
    • 176 algorithms, including public and in-house
    • Tracks automated and manual annotation, similarity and integration
  • Versioning
    • All updated or deleted rows are copied to version table
sophisticated queries
Sophisticated queries
  • Sample queries from three projects that utilize GUS’s data integration and analysis
  • www.allgenes.org
    • “Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have been localized to mouse chromosome 5?”
  • http://plasmodb.org
    • “List all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage”
  • www.cbil.upenn.edu/EPConDB
    • “Which genes on chromosome 2 are expressed in pancreas and are involved in signal transduction based on GO function assignments.”
gus object layer
GUS Object layer
  • Lightweight Perl implementation
  • Java on the way
  • One object per table
  • Parent/child relationships
  • Cascading delete
data input
Data input
  • The GusApplication program manages inserts and updates to GUS, handling tracking and versioning.
  • Specific tasks are implemented as plugins.
  • Plugins use either GUS objects or SQL access.
  • Low-level database access is provided by DBI classes.

GusApplication

SQL

Plugin

Object

SuperClasses

Object

Core

SRes

DBI

Object

DoTS

RAD

TESS

Object

Object

pipeline
Pipeline
  • Perl API for defining annotation pipelines
  • Supports sequential protocols
  • Distributes compute intensive work to compute cluster
  • Used for 90 stage pipeline to build DoTS transcript index
slide23
Web
  • Servlets and cgi based design (JSP on the way)
  • Automatic generation of HTML FORMs
    • Automated input checking
    • Integrated help features
    • INPUT elements populated from the database
  • Query history facility
  • Boolean queries (AND, OR, SUBTRACT)
  • Declarative configuration file
  • Base system is relatively independent of GUS
annotator s interface
Annotator’s interface

Assign Gene Name/Symbol

Assign Gene Description

Assign Gene Synonym(s)

Evidence

parsing exporting
Parsing & exporting
  • Parsing
    • Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR
    • Protein Motifs: CDD, Prodom, InterPro
    • Expression: MAGE
    • Ontologies: GO, SO, PATO
    • Mapping data: RH maps
    • Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder
    • Similarity: BLAST, BLAT, Sim4
    • CAP4
  • Exporting
    • FASTA
    • MAGE
    • Table dumps
    • DoTS Assemblies
analysis annotation
Analysis & annotation
  • GO functional assignment
  • Expression analysis (PaGE)
  • Anatomy classification
  • Library distribution
  • Genes from BLAT of DoTS against genome
  • DoTS assembly and annotation
    • Refresh warehouse
    • Cluster and assemble mRNAs/ESTs into putative transcripts
    • Annotate transcripts through similarity, GO function and markers
    • Integrate previously existing manual curation
slide28

DoTS Pipeline

Genomic

Sequence

mRNA/EST

Sequence

Clustering and

Assembly

Gene predictions

GenScan/ HMMer, PHAT

SIM4 or BLAT

Predicted

Genes

DoTS consensus

Sequences

Merge Genes

Gene/RNA cluster

assignment

Annotate DoTS

Manual Annotation

Tasks

Gene

Index

framefinder

RNAs

Proteins

translation

BLASTX

PFAM, Smart, ProDom

BLASTP

Other computed annotation

(EPCR,

AssemblyAnatomyPercent,

Index Key Words,

SNP analysis)

BLAST Similarities

Functional predictions

Protein

Motifs

GO Functions

references acknowledgements
References & Acknowledgements
  • References
    • Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes ResearchDiabetes 51: 1997-2004, 2002.
    • Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.
    • Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90
    • Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001.
    • Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.
    • Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.
    • Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757.
  • Acknowledgements
    • NIH grant RO1-HG-01539-03
    • DOE grant DE-FG02-00ER62893
    • Burroughs Wellcome Fund
    • NIDDK 56947 and 56954 with cosponsorship from the JDFI
related posters
Related posters
  • 114A. Web-Based Biological Discovery using the GUS Integrated Database.
  • 170A. TESS-II:Describing and Finding Gene Regulatory Sequences with Grammars
  • 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?