1 / 85

Computing with Pathway/Genome Databases

Computing with Pathway/Genome Databases. Aprox presentation time: 1.5 hrs. Overview. Summary of Pathway Tools data access mechanisms and formats Pathway Tools APIs Overview of Pathway Tools schema. Motivations to Understanding Schema.

mauve
Download Presentation

Computing with Pathway/Genome Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing with Pathway/Genome Databases

  2. Aprox presentation time: 1.5 hrs

  3. Overview • Summary of Pathway Tools data access mechanisms and formats • Pathway Tools APIs • Overview of Pathway Tools schema

  4. Motivations to Understanding Schema • When writing complex queries to PGDBs, those queries must refer to classes and slots within the schema • Queries using Lisp, Perl, Java APIs • Queries using Structured Advanced Query Form • Queries using BioVelo • Find all monomers longer than 1,000 amino acids • (loop for g in (get-class-all-instances ‘|Genes|) when (< 1000 (abs (- (get-slot-value g ‘left-end-position) (get-slot-value g ‘right-end-position) )) collect (get-slot-value g ‘product) )

  5. More Information • Pathway Tools Web Site, Tutorial Slides • http://bioinformatics.ai.sri.com/ptools/ • PTools APIs: http://brg.ai.sri.com/ptools/ptools-resources.html • Web services: http://biocyc.org/web-services.shtml • Guide to the Pathway Tools Schema • http://biocyc.org/schema.shtml • Curator's Guide • http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf

  6. References • Ontology Papers section of http://biocyc.org/publications.shtml • "An Evidence Ontology for use in Pathway/Genome Databases" • "An ontology for biological function based on molecular interactions" • "Representations of metabolic knowledge: Pathways" • "Representations of metabolic knowledge"

  7. Data Exchange • APIs: Lisp API, Java API, and Perl API • Read and modify access • Web services • Cyclone • Export to files • BioPAX Export Biopax.org • Export PGDB genome to Genbank format • Export entire PGDB as column-delimited and attribute-value file formats • Export PGDB reactions as SBML -- sbml.org • Import/Export of Pathways: between PGDBs • Import/Export of Selected Frames, for Spreadsheets • Import/Export of Compounds as Molfile, CML • BioWarehouse : Loader for Flatfiles, SQL access • http://bioinformatics.ai.sri.com/biowarehouse/ • BMC Bioinformatics 7:170 2006

  8. Pathway Tools Ontology / Schema • Ontology classes: 1621 • Datatype classes: Define objects from genomes to pathways • Classification systems for pathways, chemical compounds, enzymatic reactions (EC system) • Protein Feature ontology • Controlled vocabularies: • Cell Component Ontology • Evidence codes • Comprehensive set of 279 attributes and relationships

  9. High-Level Classes in the PathwayTools Ontology • Chemicals -- All molecules • Polymer-Segments -- Regions of polymers • Protein-Features -- Features on proteins • Organisms • Reactions -- Biochemical reactions • Enzymatic-Reactions -- Link enzymes to reactions they catalyze • Pathways -- Metabolic and signaling pathways • Regulation -- Regulatory interactions • CCO -- Cell Component Ontology • Evidence -- Evidence ontology • Gene-Ontology-Terms -- GO • Growth-Observations -- Observations of growth of organism • Notes -- Timestamped, person-stamped notes • Organizations, People • Publications

  10. Navigating the Schema

  11. Use GKB Editor to Inspect thePathway Tools Ontology • GKB Editor = Generic Knowledge Base Editor • Type in Navigator window: (GKB) or • [Right-Click] Edit->Ontology Editor • View->Browse Class Hierarchy • [Middle-Click] to expand hierarchy • To view classes or instances, select them and: • Frame -> List Frame Contents • Frame -> Edit Frame

  12. Use the SAQP to Inspect the Schema

  13. Pathway Tools Schema • Guide to the Pathway Tools Schema • Schema overview diagram

  14. Principal Classes • Class names are capitalized, plural, separated by dashes • Genetic-Elements, with subclasses: • Chromosomes • Plasmids • Genes • Transcription-Units • RNAs • rRNAs, snRNAs, tRNAs, Charged-tRNAs • Proteins, with subclasses: • Polypeptides • Protein-Complexes

  15. Principal Classes • Reactions, with subclasses: • Transport-Reactions • Enzymatic-Reactions • Pathways • Compounds-And-Elements

  16. Principal Classes • Regulation

  17. Slot Links TCA Cycle in-pathway Succinate + FAD = fumarate + FADH2 reaction Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhC sdhD sdhA sdhB

  18. Programmatic Access to BioCyc • Common LISP • Native language of Pathway Tools • Interactive & Mature Environment • Full Access to the Data & Many Utility Functions • Source code is available for academics • PerlCyc • API of Functions, Exposed to Perl • Communication through UNIX Socket • JavaCyc • API of Functions, Exposed to Java • Communication through UNIX Socket • Cyclone

  19. Cyclone • Developed by Schachter and colleagues from Genoscope • http://nemo-cyclone.sourceforge.net/archi.php • Cyclone is a Java-based system that: • Extracts data from a Pathway Tools PGDB • Converts it to an XML schema • Maps the data to Java objects and to a relational database • Changes made to the data on the Java side can be committed back to a Pathway Tools PGDB

  20. Lisp API • Accessible whenever you start Pathway Tools with the –lisp argument • Lisp queries evaluate against the running Pathway Tools binary and execute very fast

  21. Ocelot Object Database

  22. Pathway Tools Implementation Details • Platforms: • Macintosh, PC/Linux, and PC/Windows platforms • Same binary can run as desktop app or Web server • Production-quality software • Version control • Two regular releases per year • Extensive quality assurance • Extensive documentation • Auto-patch • Automatic DB-upgrade • 600,000 lines of Lisp code

  23. Pathway Tools Architecture Pathway Genome Navigator Web Mode Desktop Mode Lisp Perl Java Protein Editor Pathway Editor Reaction Editor GFP API Oracle or MySQL Disk File Ocelot DBMS

  24. Ocelot Object Database • Frame data model • Classes, instances, inheritance • Frames have slots that define their properties, attributes, relationships • A slot has one or more values • Datatypes include numbers, strings, etc. • Slotunit framesdefine metadata about slots: • Domain, range, inverse • Collection type, number of values, value constraints

  25. Storage System Architecture • File KBs • Read-only applications can be distributed without a relational DBMS • Load all objects and code into Lisp memory • Dump virtual memory to binary executable file

  26. Ocelot Storage System Architecture • Persistent storage via disk files, MySQL or Oracle DBMS • Concurrent development: MySQL or Oracle • Single-user development: disk files • Relational DBMS storage • RDBMS is submerged within Ocelot, invisible to users • Frames transferred from RDBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet

  27. Transaction Logging • Relational DBMS stores • The latest version of each Ocelot frame • A log of all GFP operations applied to KB • Transaction log enables: • Reconstruction of earlier versions of KB • View history of changes to an object • Update replicates of a KB • Detection of update conflicts during concurrency control • Undo of updates

  28. Optimistic Concurrency Control • Locking approach: edits to one object can require locking all connected objects • No locking • User performs updates in local workspace • When user commits changes, storage system compares user changes against all other committed changes

  29. Ocelot Knowledge Server Schema Evolution • FRSs store and process class and instance information similarly • Application can query schema information as easily as it can query instances • Schema is stored within the DB • Schema is self documenting • Schema evolution facilitated by • Easy addition/removal of slots, or alteration of slot datatypes • Flexible data formats that do not require dumping/reloading of data

  30. Generic Frame Protocol (GFP) • A library of procedures for accessing Ocelot DBs • GFP specification: • http://www.ai.sri.com/~gfp/spec/paper/paper.html • A small number of GFP functions are sufficient for most complex queries

  31. Example of a Single GFP Call • The General Pattern: gfp-function(frame slot value ...) (gfp-function frame slot value …) • LISP (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

  32. Frame References • At the GFP level, every Ocelot frame can be referred to using either symbol frame name or frame object • Most GFP functions return frame objects • Importance of using fequal for comparisons

  33. Generic Frame Protocol • get-class-all-instances (Class) • Returns direct and indirect instances of Class • coercible-to-frame-p (Thing) • Is Thing a frame? Returns True if Thing is the name of a frame, or a frame object; else False

  34. Generic Frame Protocol • Notation Frame.Slot means a specified slot of a specified frame. Note: Slot must be a symbol! • get-slot-value(Frame Slot) • Returns first value of Frame.Slot • get-slot-values(Frame Slot) • Returns all values of Frame.Slot as a list • slot-has-value-p(Frame Slot) • Returns True if Frame.Slot has at least one value; else False • member-slot-value-p(Frame Slot Value) • Returns True if Value is one of the values of Frame.Slot; else False • Instance-all-instance-of-p(Instance Class) • Returns True if Instance is an all-instance of Class

  35. Generic Frame Protocol • print-frame(Frame) • Prints the contents of Frame

  36. Generic Frame Protocol – Update Operations • put-slot-value(Frame Slot Value) • Replace the current value(s) of Frame.Slot with Value • put-slot-values(Frame Slot Value-List) • Replace the current value(s) of Frame.Slot with Value-List, which must be a list of values • add-slot-value(Frame Slot Value) • Add Value to the current value(s) of Frame.Slot, if any • remove-slot-value(Frame Slot Value) • Remove Value from the current value(s) of Frame.slot • replace-slot-value(Frame Slot Old-Value New-Value) • In Frame.Slot, replace Old-Value with New-Value • remove-local-slot-values(Frame Slot) • Remove all of the values of Frame.Slot

  37. Generic Frame Protocol –Update Operations • save-kb • Saves the current KB

  38. Additional Pathway Tools Functions –Semantic Inference Layer • Semantic inference layer defines built-in functions to compute commonly required relationships in a PGDB • http://bioinformatics.ai.sri.com/ptools/ptools-fns.html

  39. PerlCyc and JavaCyc • Work on Unix (Solaris or Linux) only • Start up Pathway Tools with the –api arg • Pathway Tools listens on a Unix socket – perl program communicates through this socket • Supports both querying and editing PGDBs • Must run perl or java program on the same machine that runs Pathway Tools • This is a security measure, as the API server has no built-in security • Can only handle one connection at a time

  40. Obtaining PerlCyc and JavaCyc Download from http://www.sgn.cornell.edu/downloads/ PerlCyc written and maintained by Lukas Mueller at Boyce Thompson Institute for Plant Research. JavaCyc written by Thomas Yan at Carnegie Institute, maintained by Lukas Mueller. Easy to extend…

  41. Examples of PerlCyc, JavaCyc Functions • GFP functions (require knowledge of Pathway Tools schema): • get_slot_values • get_class_all_instances • put_slot_values • Pathway Tools functions (described at http://bioinformatics.ai.sri.com/ptools/ptools-fns.html): • genes_of_reaction • find_indexed_frame • pathways_of_gene • transport_p • getSlotValues • getClassAllInstances • putSlotValues • genesOfReaction • findIndexedFrame • pathwaysOfGene • transportP

  42. Writing a PerlCyc or JavaCyc program • Create a PerlCyc, JavaCyc object: perlcyc -> new (“ORGID”) new Javacyc (“ORGID”) • Call PerlCyc, JavaCyc functions on this object: my $cyc = perlcyc -> new (“ECOLI”); my @pathways = $cyc -> all_pathways (); Javacyc cyc = new Javacyc(“ECOLI”); ArrayList pathways = cyc.allPathways (); • Functions return object IDs, not objects. • Must connect to server again to retrieve attributes of an object. foreach my $p (@pathways) { print $cyc -> get_slot_value ($p, “COMMON-NAME”);} for (int i=0; I < pathways.size(); i++) { String pwy = (String) pathways.get(i); System.out.println (cyc.getSlotValue (pwy, “COMMON-NAME”); }

  43. Sample PerlCyc Query • Number of proteins in E. coli use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); my @proteins = $cyc-> get_class_all_instances("|Proteins|"); my $protein_count = scalar(@proteins); print "Protein count: $protein_count.\n";

  44. Sample PerlCyc Query • Print IDs of all proteins with molecular weight between 10 and 20 kD and pI between 4 and 5. use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { my $mw = $cyc->get_slot_value($p, "molecular-weight-kd"); my $pI = $cyc->get_slot_value($p, "pi"); if ($mw <= 20 && $mw >= 10 && $pI <= 5 && $pI >= 4) { print "$p\n"; } }

  45. Sample PerlCyc Query • List all the transcription factors in E. coli, and the list of genes that each regulates: use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { if ($cyc->transcription_factor_p($p)) { my $name = $cyc->get_slot_value($p, "common-name"); my %genes = (); foreach my $tu ($cyc->regulon_of_protein($p)) { foreach my $g ($cyc->transcription_unit_genes($tu)) { $genes{$g} = $cyc->get_slot_value($g, "common-name"); } } print "\n\n$name: "; print join " ", values %genes; } }

  46. Sample Editing Using PerlCyc • Add a link from each gene to the corresponding object in MY-DB (assume ID is same in both cases) use perlcyc; my $cyc = perlcyc -> new (“HPY”); my @genes = $cyc->get_class_all_instances (“|Genes|”); foreach my $g (@genes) { $cyc->add_slot_value ($g, “DBLINKS”, “(MY-DB \”$g\”)”); } $cyc->save_kb();

  47. Sample JavaCyc Query:Enzymes for which ATP is a regulator import java.util.*; public class JavacycSample { public static void main(String[] args) { Javacyc cyc = new Javacyc("ECOLI"); ArrayList regframes = cyc.getClassAllInstances("|Regulation-of-Enzyme-Activity|"); for (int i = 0; i < regframes.size(); i++) { String reg = (String)regframes.get(i); boolean bool = cyc.memberSlotValueP(reg, “Regulator", "ATP"); if (bool) { String enzrxn = cyc.getSlotValue (reg, “Regulated-Entity”); String enzyme = cyc.getSlotValue (enzrxn, “Enzyme”); System.out.println(enz); } } } }

  48. Simple Lisp Query Example:Enzymes for which ATP is a regulator (defun atp-inhibits () (loop for x in (get-class-all-instances '|Regulation-of-Enzyme-Activity|) ;; Does the Regulator slot contain the compound ATP, and the mode ;; of regulation is negative (inhibition)? when (and (member-slot-value-p x ‘Regulator 'ATP) (member-slot-value-p x ‘Mode “-”) ) ;; Whenever the test is positive, we collect the value of the slot Enzyme ;; of the Regulated-Entity of the regulatory interaction frame. ;; The collected values are returned as a list, once the loop terminates. collect (get-slot-value (get-slot-value x ‘Regulated-Entity) ‘Enzyme) ) ) ;;; invoking the query: (select-organism :org-id 'ECOLI) (atp-inhibits) (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

  49. Simple Perl Query Example:Enzymes for which ATP is a regulator use perlcyc; my $cyc = perlcyc -> new("ECOLI"); my @regs = $cyc -> get_class_all_instances("|Regulation-of-Enzyme-Activity|"); ## We check every instance of the class foreach my $reg (@regs) { ## We test for whether the INHIBITORS-ALL ## slot contains the compound frame ATP my $bool1 = $cyc -> member_slot_value_p($reg, “Regulator", "Atp"); my $bool2 = $cyc -> member_slot_value_p($reg, “Mode", “-"); if ($bool1 && $bool2) { ## Whenever the test is positive, we collect the value of the slot ENZYME . ## The results are printed in the terminal. my $enzrxn = $cyc -> get_slot_value($reg, “Regulated-Entity"); my $enz = $cyc -> get_slot_value($enzrxn, "Enzyme"); print STDOUT "$enz\n"; } }

  50. Getting started with Lisp • pathway-tools –lisp • (load “file”) (compile-file “file.lisp”) • Emacs is a useful editor • Pathway Tools source code is available: ask • Overview of Lisp information resources: • http://bioinformatics.ai.sri.com/ptools/ptools-resources.html • Documented Pathway Tools Lisp functions: • http://brg.ai.sri.com/ptools/ptools-fns.html

More Related