The ecocyc and metacyc pathway genome databases
Download
1 / 35

The EcoCyc and MetaCyc Pathway/Genome Databases - PowerPoint PPT Presentation


  • 331 Views
  • Updated On :
  • Presentation posted in: Pets / Animals

The EcoCyc and MetaCyc Pathway/Genome Databases. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://www.ai.sri.com/pkarp/ http://EcoCyc.org/. Overview. Motivations and terminology Pathway/genome databases BioCyc collection EcoCyc, MetaCyc

Related searches for The EcoCyc and MetaCyc Pathway/Genome Databases

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

The EcoCyc and MetaCyc Pathway/Genome Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The EcoCyc and MetaCyc Pathway/Genome Databases

Peter D. Karp, Ph.D.

Bioinformatics Research Group

SRI International

pkarp@ai.sri.com

http://www.ai.sri.com/pkarp/

http://EcoCyc.org/


Overview

  • Motivations and terminology

  • Pathway/genome databases

    • BioCyc collection

    • EcoCyc, MetaCyc

  • Pathway Tools software

  • Bioinformatics Database Warehouse project


A

E


What to do When Theories BecomeLarger than Minds can Grasp?

  • Example: E. coli metabolic network

    • 160 pathways involving 744 reactions and 791 substrates

  • Example: E. coli genetic network

    • Control by 97 transcription factors of 1174 genes in 630 transcription units

  • Past solutions:

    • Partition theories across multiple minds

    • Encode theories in natural-language text

  • We cannot compute with theories in those forms

    • Evaluate theories for consistency with new data: microarrays

    • Refine theories with respect to new data

    • Compare theories describing different organisms


Solution: Biological Knowledge Bases

  • Store biological knowledge and theories in computers in a declarative form

    • Amenable to computational analysis and generative user interfaces

  • Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases

  • Accepted to store data in computers, but not knowledge

  • Such knowledge bases are an integral part of the scientific enterprise


Pathway Definition

  • Chemical reactions interconvert chemical compounds

  • An enzyme is a protein that accelerates chemical reactions

  • A pathway is a linked set of reactions

  • Often regulated as a unit

  • A conceptual unit of cell’s biochemical machine

A + B C + D

A C E


Model Organism Database (MOD) – DB describing genome and other information about an organism

Pathway/Genome Database (PGDB) – MOD that combines information about

Pathways, reactions, substrates

Enzymes, transporters

Genes, replicons

Transcription factors, promoters, operons, DNA binding sites

BioCyc – Collection of 15 PGDBs at BioCyc.org

EcoCyc, AgroCyc, YeastCyc

Terminology


BioCyc Collection ofPathway/Genome DBs

Computationally Derived Datasets:

  • Agrobacterium tumefaciens

  • Caulobacter crescentus

  • Chlamydia trachomatis

  • Bacillus subtilis

  • Helicobacter pylori

  • Haemophilus influenzae

  • Mycobacterium tuberculosis RvH37

  • Mycobacterium tuberculosis CDC1551

  • Mycoplasma pneumonia

  • Pseudomonas aeruginosa

  • Saccharomyces cerevisiae

  • Treponema pallidum

  • Vibrio cholerae

  • Yellow = Open Database

  • Literature-based Datasets:

  • MetaCyc

  • Escherichia coli (EcoCyc)

http://BioCyc.org/


Terminology –Pathway Tools Software

  • PathoLogic

    • Prediction of metabolic network from genome

    • Computational creation of new Pathway/Genome Databases

  • Pathway/Genome Editors

    • Distributed curation of PGDBs

    • Distributed object database system, interactive editing tools

  • Pathway/Genome Navigator

    • WWW publishing of PGDBs

    • Querying, visualization of pathways, chromosomes, operons

    • Analysis operations

      • Pathway visualization of gene-expression data

      • Global comparisons of metabolic networks

  • Bioinformatics 18:S225 2002


Query, visualization and editing tools for these datatypes:

Full Metabolic Map

Paint gene expression data on metabolic network; compare metabolic networks

Pathways

Pathway prediction

Reactions

Balance checker

Compounds

Chemical substructure comparison

Enzymes, Transporters, Transcription Factors

Genes: Blast search

Chromosomes

Operons

Operon prediction

Pathway Tools Algorithms


Model Organism Databases

  • DBs that describe the genome and other information about an organism

  • Every sequenced organism with an active experimental community requires a MOD

    • Integrate genome data with information about the biochemical and genetic network of the organism

  • MODs are platforms for global analyses of an organism

    • Interpret gene expression data in a pathway context

    • Characterize systems properties of metabolic and genetic networks

    • Determine consistency of metabolic and transport networks

    • In silico prediction of essential genes


EcoCyc Project – EcoCyc.org

  • E.coli Encyclopedia

    • Model-Organism Database for E. coli

    • Computational symbolic theory of E. coli

    • Electronic review article for E. coli – over 3500 literature citations

    • Tracks the evolving annotation of the E. coli genome

  • Collaborative development via Internet

    • Karp (SRI) -- Bioinformatics architect

    • John Ingraham -- Advisor

    • (SRI) Metabolic pathways

    • Saier (UCSD) and Paulsen (TIGR)-- Transport

    • Collado (UNAM)-- Regulation of gene expression

  • Database content: 18,000 objects


EcoCyc = E.coli Dataset + Pathway/Genome Navigator

Pathways: 165

Reactions: 2,760

Compounds: 774

Enzymes: 914

Transporters: 162

Promoters: 812

TransFac Sites: 956

Citations: 3,508

Proteins: 4,273

Transcription

Units: 724

Factors: 110

Genes: 4,393

http://EcoCyc.org/


EcoCyc Procedures

  • All DB updates by 5 staff curators

    • Information gathered from biomedical literature

    • Corrections solicited from E. coli researchers

  • Review-level database

  • Four releases per year

  • Available through WWW site, as data files, as downloadable application

  • Quality assurance of data and software:

    • Evaluate database consistency constraints

    • Perform element balancing of reactions

    • Run other checking programs

    • Display every DB object


MetaCyc: Metabolic Encyclopedia

  • Nonredundant metabolic pathway database

  • Describe a representative sample of every experimentally determined metabolic pathway

  • Literature-based DB with extensive references and commentary

  • Pathways, reactions, enzymes, substrates

  • 460 pathways, 1267 enzymes, 4294 reactions

    • 172 E. coli pathways, 2735 citations

  • Nucleic Acids Research 30:59-61 2002.

  • Jointly developed by SRI and Carnegie Institution

    • New focus on plant pathways


Family of Pathway/GenomeDatabases

MetaCyc


Pathway Tools Implementation Details

  • Allegro Common Lisp

  • Sun and PC platforms

  • Ocelot object database

  • 250,000 lines of code

  • Lisp-based WWW server at BioCyc.org

    • Manages 15 PGDBs


Pathway Tools Architecture

WWW

Server

X-Windows

Graphics

Object Editor

Pathway Editor

Reaction Editor

GFP API

Oracle

Pathway

Genome

Navigator

Object DBMS


Ocelot Knowledge Server Architecture

  • Frame data model

    • Classes, instances, inheritance

    • Frames have slots that define their properties, attributes, relationships

    • A slot has one or more values

    • Each value can be any Lisp datatype

    • Slotunits define metadata about slots:

      • Domain, range, inverse

      • Collection type, number of values, value constraints

  • Transaction logging facility

  • Schema evolution


Ocelot Storage System Architecture

  • Persistent storage via disk files, Oracle DBMS

    • Concurrent development: Oracle

    • Single-user development: disk files

    • Read-only delivery: bundle data into binary program

  • Oracle storage

    • DBMS is submerged within Ocelot, invisible to users

    • Relational schema is domain independent, supports multiple KBs simultaneously

    • Frames transferred from DBMS to Ocelot

      • On demand

      • By background prefetcher

      • Memory cache

      • Persistent disk cache to speed performance via Internet


The Common Lisp ProgrammingEnvironment

  • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)


EcoCyc WWW Server


Plasmodium falciparum, Stanford University

plasmocyc.stanford.edu

Mycobacterium tuberculosis, Stanford University

BioCyc.org

Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington

Arabidopsis.org:1555

Methanococcus janaschii, EBI

Maine.ebi.ac.uk:1555

Other PGDBs in progress by 24 other users

Software freely available

Each PGDB owned by its creator

Pathway/Genome DBs Created byExternal Users


Global Consistency Checking of Biochemical Network

  • Given:

    • A PGDB for an organism

    • A set of initial metabolites

  • Infer:

    • What set of products can be synthesized by the small-molecule metabolism of the organism

  • Can known growth medium yield known essential compounds?

  • Pacific Symposium on Biocomputing p471 2001


Algorithm:Forward Propagation

Nutrient

set

Products

PGDB

reaction

pool

Transport

“Fire”

reactions

Metabolite

set

Reactants


Results

  • Phase I: Forward propagation

    • 21 initial compounds yielded only half of 38 essential compounds for E. coli

  • Phase II: Manually identify

    • Bugs in EcoCyc (e.g., two objects for tryptophan)

    • Missing initial protein substrates (e.g., ACP)

    • Missing pathways in EcoCyc

  • Phase III: Forward propagation with 11 more initial metabolites

    • Yielded all 38 essential compounds


Nutrient-Related Analysis:Validation of the EcoCyc Database

Results on EcoCyc:

  • Phase I:

    • Essential compounds

      • produced 19

      • not produced19

    • Total compounds

      • produced: (28%)

    • Reactions

      • Fired (31%)


Missing Essential Compounds Due To

  • Bugs in EcoCyc

  • Narrow conceptualization of the problem

    • Protein substrates

  • Incomplete biochemical knowledge


Nutrient-Related Analysis:Validation of the EcoCyc Database

Results on EcoCyc:

  • Phase II (After adding 11 extra metabolites):

    • Essential compounds

      • produced 38

      • not produced0

    • Total compounds

      • produced: (49%)

      • not produced: (51%)

    • Reactions

      • Fired (58%)

      • Not fired (42%)


Pathway Tools Misconceptions

  • PathoLogic

    • Does not re-annotate genomes

  • Pathway Tools does not handle quantitative information

  • Pathway/Genome Editors do not work through the web


HumanCyc: Human Metabolic PathwayDatabase Consortium

  • Construct DB of human metabolic pathways using PathoLogic

  • Link to human genome web sites

  • Hire one curator to refine and curate with respect to literature over a 2 year period

    • Remove false-positive predictions

    • Insert known pathways missed by PathoLogic

    • Add comments and citations from pathways and enzymes to the literature

    • Add enzyme activators, inhibitors, cofactors, tissue information

  • Available as flatfiles and with Pathway/Genome Navigator

  • New versions to be released every 6 months


Summary

  • Pathway/Genome Databases

    • MetaCyc non-redundant DB of literature-derived pathways

    • 14 organism-specific PGDBs available through SRI at BioCyc.org

    • Computational theories of biochemical machinery

  • Pathway Tools software

    • Extract pathways from genomes

    • Morph annotated genome into structured ontology

    • Distributed curation tools for MODs

    • Query, visualization, WWW publishing


BioCyc and Pathway Tools Availability

  • WWW BioCyc freely available to all

    • BioCyc.org

    • Six BioCyc DBs openly available to all

  • BioCyc DBs freely available to non-profits

    • Flatfiles downloadable from BioCyc.org

    • Binary executable:

      • Sun UltraSparc-170 w/ 64MB memory

      • PC, 400MHz CPU, 64MB memory, Windows-98 or newer

    • PerlCyc API

  • Pathway Tools freely available to non-profits


SRI

Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud

EcoCyc Project

Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier

MetaCyc Project

Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville

Stanford

Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh

Funding sources:

NIH National Center for Research Resources

NIH National Institute of General Medical Sciences

NIH National Human Genome Research Institute

Department of Energy Microbial Cell Project

DARPA BioSpice, UPC

Acknowledgements

BioCyc.org


ad
  • Login