the ecocyc and metacyc pathway genome databases l.
Download
Skip this Video
Download Presentation
The EcoCyc and MetaCyc Pathway/Genome Databases

Loading in 2 Seconds...

play fullscreen
1 / 35

The EcoCyc and MetaCyc Pathway/Genome Databases - PowerPoint PPT Presentation


  • 355 Views
  • Uploaded on

The EcoCyc and MetaCyc Pathway/Genome Databases. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://www.ai.sri.com/pkarp/ http://EcoCyc.org/. Overview. Motivations and terminology Pathway/genome databases BioCyc collection EcoCyc, MetaCyc

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The EcoCyc and MetaCyc Pathway/Genome Databases' - Patman


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the ecocyc and metacyc pathway genome databases

The EcoCyc and MetaCyc Pathway/Genome Databases

Peter D. Karp, Ph.D.

Bioinformatics Research Group

SRI International

pkarp@ai.sri.com

http://www.ai.sri.com/pkarp/

http://EcoCyc.org/

overview
Overview
  • Motivations and terminology
  • Pathway/genome databases
    • BioCyc collection
    • EcoCyc, MetaCyc
  • Pathway Tools software
  • Bioinformatics Database Warehouse project
slide3

A

E

what to do when theories become larger than minds can grasp
What to do When Theories BecomeLarger than Minds can Grasp?
  • Example: E. coli metabolic network
    • 160 pathways involving 744 reactions and 791 substrates
  • Example: E. coli genetic network
    • Control by 97 transcription factors of 1174 genes in 630 transcription units
  • Past solutions:
    • Partition theories across multiple minds
    • Encode theories in natural-language text
  • We cannot compute with theories in those forms
    • Evaluate theories for consistency with new data: microarrays
    • Refine theories with respect to new data
    • Compare theories describing different organisms
solution biological knowledge bases
Solution: Biological Knowledge Bases
  • Store biological knowledge and theories in computers in a declarative form
    • Amenable to computational analysis and generative user interfaces
  • Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases
  • Accepted to store data in computers, but not knowledge
  • Such knowledge bases are an integral part of the scientific enterprise
pathway definition
Pathway Definition
  • Chemical reactions interconvert chemical compounds
  • An enzyme is a protein that accelerates chemical reactions
  • A pathway is a linked set of reactions
  • Often regulated as a unit
  • A conceptual unit of cell’s biochemical machine

A + B C + D

A C E

terminology
Model Organism Database (MOD) – DB describing genome and other information about an organism

Pathway/Genome Database (PGDB) – MOD that combines information about

Pathways, reactions, substrates

Enzymes, transporters

Genes, replicons

Transcription factors, promoters, operons, DNA binding sites

BioCyc – Collection of 15 PGDBs at BioCyc.org

EcoCyc, AgroCyc, YeastCyc

Terminology
biocyc collection of pathway genome dbs
BioCyc Collection ofPathway/Genome DBs

Computationally Derived Datasets:

  • Agrobacterium tumefaciens
  • Caulobacter crescentus
  • Chlamydia trachomatis
  • Bacillus subtilis
  • Helicobacter pylori
  • Haemophilus influenzae
  • Mycobacterium tuberculosis RvH37
  • Mycobacterium tuberculosis CDC1551
  • Mycoplasma pneumonia
  • Pseudomonas aeruginosa
  • Saccharomyces cerevisiae
  • Treponema pallidum
  • Vibrio cholerae
  • Yellow = Open Database
  • Literature-based Datasets:
  • MetaCyc
  • Escherichia coli (EcoCyc)

http://BioCyc.org/

terminology pathway tools software
Terminology –Pathway Tools Software
  • PathoLogic
    • Prediction of metabolic network from genome
    • Computational creation of new Pathway/Genome Databases
  • Pathway/Genome Editors
    • Distributed curation of PGDBs
    • Distributed object database system, interactive editing tools
  • Pathway/Genome Navigator
    • WWW publishing of PGDBs
    • Querying, visualization of pathways, chromosomes, operons
    • Analysis operations
      • Pathway visualization of gene-expression data
      • Global comparisons of metabolic networks
  • Bioinformatics 18:S225 2002
pathway tools algorithms
Query, visualization and editing tools for these datatypes:

Full Metabolic Map

Paint gene expression data on metabolic network; compare metabolic networks

Pathways

Pathway prediction

Reactions

Balance checker

Compounds

Chemical substructure comparison

Enzymes, Transporters, Transcription Factors

Genes: Blast search

Chromosomes

Operons

Operon prediction

Pathway Tools Algorithms
model organism databases
Model Organism Databases
  • DBs that describe the genome and other information about an organism
  • Every sequenced organism with an active experimental community requires a MOD
    • Integrate genome data with information about the biochemical and genetic network of the organism
  • MODs are platforms for global analyses of an organism
    • Interpret gene expression data in a pathway context
    • Characterize systems properties of metabolic and genetic networks
    • Determine consistency of metabolic and transport networks
    • In silico prediction of essential genes
ecocyc project ecocyc org
EcoCyc Project – EcoCyc.org
  • E.coli Encyclopedia
    • Model-Organism Database for E. coli
    • Computational symbolic theory of E. coli
    • Electronic review article for E. coli – over 3500 literature citations
    • Tracks the evolving annotation of the E. coli genome
  • Collaborative development via Internet
    • Karp (SRI) -- Bioinformatics architect
    • John Ingraham -- Advisor
    • (SRI) Metabolic pathways
    • Saier (UCSD) and Paulsen (TIGR)-- Transport
    • Collado (UNAM)-- Regulation of gene expression
  • Database content: 18,000 objects
ecocyc e coli dataset pathway genome navigator
EcoCyc = E.coli Dataset + Pathway/Genome Navigator

Pathways: 165

Reactions: 2,760

Compounds: 774

Enzymes: 914

Transporters: 162

Promoters: 812

TransFac Sites: 956

Citations: 3,508

Proteins: 4,273

Transcription

Units: 724

Factors: 110

Genes: 4,393

http://EcoCyc.org/

ecocyc procedures
EcoCyc Procedures
  • All DB updates by 5 staff curators
    • Information gathered from biomedical literature
    • Corrections solicited from E. coli researchers
  • Review-level database
  • Four releases per year
  • Available through WWW site, as data files, as downloadable application
  • Quality assurance of data and software:
    • Evaluate database consistency constraints
    • Perform element balancing of reactions
    • Run other checking programs
    • Display every DB object
metacyc meta bolic en cyc lopedia
MetaCyc: Metabolic Encyclopedia
  • Nonredundant metabolic pathway database
  • Describe a representative sample of every experimentally determined metabolic pathway
  • Literature-based DB with extensive references and commentary
  • Pathways, reactions, enzymes, substrates
  • 460 pathways, 1267 enzymes, 4294 reactions
    • 172 E. coli pathways, 2735 citations
  • Nucleic Acids Research 30:59-61 2002.
  • Jointly developed by SRI and Carnegie Institution
    • New focus on plant pathways
pathway tools implementation details
Pathway Tools Implementation Details
  • Allegro Common Lisp
  • Sun and PC platforms
  • Ocelot object database
  • 250,000 lines of code
  • Lisp-based WWW server at BioCyc.org
    • Manages 15 PGDBs
pathway tools architecture
Pathway Tools Architecture

WWW

Server

X-Windows

Graphics

Object Editor

Pathway Editor

Reaction Editor

GFP API

Oracle

Pathway

Genome

Navigator

Object DBMS

ocelot knowledge server architecture
Ocelot Knowledge Server Architecture
  • Frame data model
    • Classes, instances, inheritance
    • Frames have slots that define their properties, attributes, relationships
    • A slot has one or more values
    • Each value can be any Lisp datatype
    • Slotunits define metadata about slots:
      • Domain, range, inverse
      • Collection type, number of values, value constraints
  • Transaction logging facility
  • Schema evolution
ocelot storage system architecture
Ocelot Storage System Architecture
  • Persistent storage via disk files, Oracle DBMS
    • Concurrent development: Oracle
    • Single-user development: disk files
    • Read-only delivery: bundle data into binary program
  • Oracle storage
    • DBMS is submerged within Ocelot, invisible to users
    • Relational schema is domain independent, supports multiple KBs simultaneously
    • Frames transferred from DBMS to Ocelot
      • On demand
      • By background prefetcher
      • Memory cache
      • Persistent disk cache to speed performance via Internet
the common lisp programming environment
The Common Lisp ProgrammingEnvironment
  • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
pathway genome dbs created by external users
Plasmodium falciparum, Stanford University

plasmocyc.stanford.edu

Mycobacterium tuberculosis, Stanford University

BioCyc.org

Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington

Arabidopsis.org:1555

Methanococcus janaschii, EBI

Maine.ebi.ac.uk:1555

Other PGDBs in progress by 24 other users

Software freely available

Each PGDB owned by its creator

Pathway/Genome DBs Created byExternal Users
global consistency checking of biochemical network
Global Consistency Checking of Biochemical Network
  • Given:
    • A PGDB for an organism
    • A set of initial metabolites
  • Infer:
    • What set of products can be synthesized by the small-molecule metabolism of the organism
  • Can known growth medium yield known essential compounds?
  • Pacific Symposium on Biocomputing p471 2001
algorithm forward propagation
Algorithm:Forward Propagation

Nutrient

set

Products

PGDB

reaction

pool

Transport

“Fire”

reactions

Metabolite

set

Reactants

results
Results
  • Phase I: Forward propagation
    • 21 initial compounds yielded only half of 38 essential compounds for E. coli
  • Phase II: Manually identify
    • Bugs in EcoCyc (e.g., two objects for tryptophan)
    • Missing initial protein substrates (e.g., ACP)
    • Missing pathways in EcoCyc
  • Phase III: Forward propagation with 11 more initial metabolites
    • Yielded all 38 essential compounds
nutrient related analysis validation of the ecocyc database
Nutrient-Related Analysis:Validation of the EcoCyc Database

Results on EcoCyc:

  • Phase I:
    • Essential compounds
      • produced 19
      • not produced 19
    • Total compounds
      • produced: (28%)
    • Reactions
      • Fired (31%)
missing essential compounds due to
Missing Essential Compounds Due To
  • Bugs in EcoCyc
  • Narrow conceptualization of the problem
    • Protein substrates
  • Incomplete biochemical knowledge
nutrient related analysis validation of the ecocyc database30
Nutrient-Related Analysis:Validation of the EcoCyc Database

Results on EcoCyc:

  • Phase II (After adding 11 extra metabolites):
    • Essential compounds
      • produced 38
      • not produced 0
    • Total compounds
      • produced: (49%)
      • not produced: (51%)
    • Reactions
      • Fired (58%)
      • Not fired (42%)
pathway tools misconceptions
Pathway Tools Misconceptions
  • PathoLogic
    • Does not re-annotate genomes
  • Pathway Tools does not handle quantitative information
  • Pathway/Genome Editors do not work through the web
humancyc human metabolic pathway database consortium
HumanCyc: Human Metabolic PathwayDatabase Consortium
  • Construct DB of human metabolic pathways using PathoLogic
  • Link to human genome web sites
  • Hire one curator to refine and curate with respect to literature over a 2 year period
    • Remove false-positive predictions
    • Insert known pathways missed by PathoLogic
    • Add comments and citations from pathways and enzymes to the literature
    • Add enzyme activators, inhibitors, cofactors, tissue information
  • Available as flatfiles and with Pathway/Genome Navigator
  • New versions to be released every 6 months
summary
Summary
  • Pathway/Genome Databases
    • MetaCyc non-redundant DB of literature-derived pathways
    • 14 organism-specific PGDBs available through SRI at BioCyc.org
    • Computational theories of biochemical machinery
  • Pathway Tools software
    • Extract pathways from genomes
    • Morph annotated genome into structured ontology
    • Distributed curation tools for MODs
    • Query, visualization, WWW publishing
biocyc and pathway tools availability
BioCyc and Pathway Tools Availability
  • WWW BioCyc freely available to all
    • BioCyc.org
    • Six BioCyc DBs openly available to all
  • BioCyc DBs freely available to non-profits
    • Flatfiles downloadable from BioCyc.org
    • Binary executable:
      • Sun UltraSparc-170 w/ 64MB memory
      • PC, 400MHz CPU, 64MB memory, Windows-98 or newer
    • PerlCyc API
  • Pathway Tools freely available to non-profits
acknowledgements
SRI

Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud

EcoCyc Project

Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier

MetaCyc Project

Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville

Stanford

Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh

Funding sources:

NIH National Center for Research Resources

NIH National Institute of General Medical Sciences

NIH National Human Genome Research Institute

Department of Energy Microbial Cell Project

DARPA BioSpice, UPC

Acknowledgements

BioCyc.org