1 / 42

Pathway/Genome Databases and Software Tools

Pathway/Genome Databases and Software Tools. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://ecocyc.DoubleTwist.com/ecocyc/. Overview. Overview of bioinformatics Motivations for the EcoCyc project EcoCyc demo

zola
Download Presentation

Pathway/Genome Databases and Software Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://ecocyc.DoubleTwist.com/ecocyc/

  2. Overview • Overview of bioinformatics • Motivations for the EcoCyc project • EcoCyc demo • Description of EcoCyc database and Pathway Tools software • Underlying technologies • Ocelot object database • GKB Editor • X-windows to WWW translator

  3. Definition of Bioinformatics • Computational techniques for management and analysis of biological data and knowledge • Methods for disseminating, archiving, interpreting, and mining scientific information

  4. Motivations for Bioinformatics • Growth in molecular-biology knowledge • Industrialization of biological experimentation • High-throughput biology • Genome sequences • Gene and protein expression data • Protein-protein interaction data • Protein 3-D structures • ….

  5. A E

  6. Motivations for EcoCyc -- E. coli Encyclopedia • Integrate E. coli information dispersed in the literature • New paradigm of scientific publishing • Model the full metabolic network of an organism • Integrate genomic data with functional data • Develop algorithms for computing with function • Provide a challenging domain for computer-science research

  7. Definitions A C E • A chemical reaction interconverts chemical compounds • An enzyme is a protein that accelerates chemical reactions • A pathway is a linked set of reactions • A conceptual unit of cell’s biochemical machine A + B = C + D

  8. Organism-Specific Pathway/Genome Databases • Layer functional information above the genome • Rich ontology to encode biological information with high fidelity • Chromosomes, genes, operons, gene products, reactions, pathways • Curated by experts for that organism • Integrate literature and computational predictions

  9. Pathway Tools Software • Pathway/Genome Navigator • WWW publishing of PGDBs • Graphic depictions of pathways, chromosomes, operons • Pathway visualization of gene-expression data • Pathway/Genome Editors • Distributed curation of genome annotations • Distributed object database system • Interactive editing tools • PathoLogic • Prediction of metabolic network from genome

  10. EcoCyc = E.coli Dataset + Pathway/Genome Navigator Operons: 375 Metabolic Network Pathways: 158 Reactions: 1,117 Compounds: 1,887 Gene Products: 4,393 Genes: 4,393 http://ecocyc.DoubleTwist.com/ecocyc/

  11. EcoCyc • Collaborative development via internet • Karp -- Bioinformatics architect • Riley -- Metabolic pathways, signal transduction • Saier and Paulsen -- Transport • Collado -- Regulation of gene expression • Ontology of 1000 biological classes • 14,000 instances • Over 2,600 registered users

  12. Pathway Tools Software Pathway/Genome Navigator Pathway/ Genome Databases PathoLogic Pathway Predictor Pathway/ Genome Editors

  13. Creation of the Overview Graph • Run layout algorithms on individual pathway graphs • Automatically determine topology of pathway graph • Apply associated layout algorithm (linear, circular, tidy tree) • Use superpathways to create hierarchical layouts • Treat each individual pathway as a single node • Pathway connections are edges • Run appropriate layout algorithm • Manually position the resulting pathway clusters

  14. Inference of Metabolic Pathways ANNOTATED GENOME Structured ASCII Text File List of Gene Products List of Genes/ORFs DNA Sequence Pathway/Genome Database MetaCyc Metabolic Network Pathway PathoLogic Compounds Reactions Gene Products Genes Reports Genomic Map

  15. Summary of H. pylori Analysis • For 121 E. coli pathways, what is the evidence that each pathway occurs in H. pylori? • Strong evidence: 41 • Medium evidence: 29 • Little or no evidence: 51 • 31 reactions catalyzed by H. pylori but not by E. coli • H. pylori has partial abilities to synthesize cofactors and amino-acids, extremely limited carbohydrate catabolism, some amino acid utilization, and a reductive citric-acid pathway

  16. Microbial Pathway/Genome DBs Literature-based Datasets: • MetaCyc • Escherichia coli PathoLogic-based Datasets: • Bacillus subtilis • Mycobacterium tuberculosis • Helicobacter pylori • Haemophilus influenzae • Mycoplasma pneumonia • Treponema pallidum • Chlamydia trachomatis • Saccharomyces cerevisiae

  17. Pathway Tools Software Architecture • Implemented in Common Lisp • WWW server runs as a single Unix process with a separate thread to service each query • Grasper-CL graph manager • Ocelot object database • GKB Editor schema-driven editor

  18. EcoCyc WWW Server

  19. Pathway Tools Architecture --Development Configuration WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor GFP API Oracle Pathway Genome Navigator Ocelot DBMS

  20. Ocelot Database System • Object Database Manager • Persistence via filesystem or relational DBMS • Demand and background faulting of objects from RDBMS • Two-level object caching • Extensive bioinformatics schema • Stored transaction history • Inspect object history

  21. Ocelot Knowledge Server Architecture • Frame data model • Persistent storage via • Disk files • Oracle DBMS • Optimistic concurrency-control protocol • Schema evolution • Logging facility

  22. The Frame Data Model • Frames are of two types: classes, instances • Frames have slots that define their properties, attributes, relationships • A slot has one or more values • Each value can be any Lisp datatype • Slotunits define metadata about slots: • Domain, range, inverse • Collection type, number of values, value constraints

  23. Inference Capabilities • Inheritance of defaults • Slot values computed via attached procedures • Maintenance of inverse relationships • Constraint system • Deferred evaluation • Tolerant of nonconformant data

  24. Storage System Architecture • Oracle KBs • DBMS is submerged within FRS • Relational schema is domain independent, supports multiple KBs simultaneously • Frames transferred from DBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet

  25. Frame Faulting (get-slot-value gene ‘map-position) • Gene present in in-memory object cache? • Gene present in cache on local disk? • Query Oracle DBMS

  26. Logging • Oracle DBMS stores: • The latest version of each frame • A history of all OKBC operations applied to KB • Reconstruct earlier versions of KB • View history of changes to an object • Update replicates • Concurrency control

  27. Schema Management • FRSs store and process class and instance information similarly • Applications can query schema information as easily as they can query instances

  28. GKB Editor • Browser and editor for KBs and ontologies • Four editing tools • GKB Editor reusable with multiple FRSs • All database queries via OKBC/GFP API • Interoperability achieved with Ocelot, LOOM, Ontolingua • All operations are schema driven • http://www.ai.sri.com/~gkb/overview.html

  29. Editors • Taxonomy editor • Frame editor • Relationships editor • Spreadsheet editor

  30. Results • Ocelot in use in the EcoCyc project for 5 years • Supports collaborative development of EcoCyc by four groups in North America • Distributed architecture • GKB Editor in active use • Supports development of 8 Pathway/Genome Databases

  31. Summary • Pathway/Genome Databases • Pathway Tools software • Extract pathways from genomes • Distributed curation tools • Query, visualization, WWW publishing • Analysis algorithms

  32. Computer Science Results • Extend scalability and multiuser access for knowledge representation systems • Reusable, schema-driven KB editor • Hierarchical graph layout algorithms • Dynamic translation from X-windows to HTML+GIF • Importance of ontologies and of content: • Discovery = Algorithm + Database

  33. Problem Solving Depends onAlgorithms and Content Compute Time Algorithm Quality Solution Quality Database Size and Quality

  34. Bioinformatics Results:Content • The EcoCyc database describes the full metabolic map of an organism • The MetaCyc database describes over 300 metabolic pathways • Ontology spans genome to pathway information

  35. Bioinformatics Results:Algorithms • Software environment for genome and pathway information • Query and visualization • Distributed database development • PathoLogic algorithm predicts the metabolic network of an organism from its genome • Algorithms under development for qualitative modeling of the cell

  36. Acknowledgements • Funding sources: • NIH National Center for Research Resources • Collaborators: • Monica Riley, Marine Biological Laboratory • Milton Saier, UC San Diego • Julio Collado, UNAM • Christos Ouzounis, European Bioinformatics Institute Peter D. Karp, Ph.D. http://www.ai.sri.com/pkarp/ pkarp@ai.sri.com

More Related