1 / 39

Data Creation, Federation, and Delivery Tools: Biology and IT at SDSC

Data Creation, Federation, and Delivery Tools: Biology and IT at SDSC. Mark A. Miller and Tracy Zhao. How do we get from Genomics to Genomic Medicine ?. Four strands in technology are coming together:

hashim
Download Presentation

Data Creation, Federation, and Delivery Tools: Biology and IT at SDSC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Creation, Federation, and Delivery Tools: Biology and IT at SDSC • Mark A. Miller and Tracy Zhao

  2. How do we get from Genomics to Genomic Medicine ?

  3. Four strands in technology are coming together: • increasingly sensitive instruments can analyze many molecular parameters of a living organism simultaneously in near real time. • bioinformatic tools are making it possible to identify correlations between molecular characteristics and observable features at increasing levels of biological scale. • sophisticated analysis of these data streams are conceivable due to the increasing power of computers at diminishing cost. • wireless and wired technologies together with diminishing storage costs make it possible to aggregate these data and analyze them over enormous numbers of organisms.

  4. Where can these technologies take us? Risk Avoidance Molecular Data Diagnostic Data Federation Modeling Treatment Concepts Preventative Measures Environmental Data

  5. Multi-site Projects have common needs! • BIRN • Alliance for Cell Signaling • Joint Center for Structural Genomics • Protein Data Bank • CIPRes (Tree of Life) • Lipid Maps (Lipid Metabonomics) • The Biology Workbench

  6. Interdisciplinary Modeling Thematic Interfaces Cell modeling Pathogen/Host Interactions Epidemiology Data Views and Tools Data Capture Domain Scientists • Disciplinary Interfaces • Molecular Biology • Epidemiology • Pharmacology • Microbiology • Pathology Virtual Collections Virtual Collections Virtual Collections Cycles User Communities MIDAS Informatics Center Linux Clusters Distributed Resources SMPs Discovery Portal Data Federated Data Collections Integrated Views CDC SIO NASA NCBI EBI WHO

  7. SDSC is working to serve the community by • creating new functionalities needed by data • intensive Biology. • The following tools are under active development: • Tools for creating and depositing experimental data • Workflow software • Access to distributed computing resources • Peer to Peer electronic notebook • Integrated data resources • Tools for viewing and analyzing data

  8. 1. Tools for creating and depositing experimental data : the EOL project

  9. Goal of EOL Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141 Software pipeline

  10. Tool 1 Tool 2 Tool 3 Tool 4 Creating a genome Annotation Pipeline Individual genome sequences Annotation tools from the community Output Output Output Output

  11. Tool 1 Tool 2 Tool 3 Tool 4 Hard code a script that links the tools: ALL genome sequences The integrated Genome Annotation Program (iGAP) Federated DATABASES Annotation tools from the community • Features: • annotation of all genomes by automated program portfolio • all runs stored in federated database • federation of local and public databases at API level • results served via SOAP server • interface facilitates novel queries • interface facilitates data management and exchange

  12. Distributed resources Home resource iGAP iGAP jAPST APST iGAP iGAP WMSD iGAP iGAP Job Status Database iGAP JSP Federated Database Deploy the hard coded pipeline on many resources, coordinated By a workflow management system:

  13. Next step: flexible visual programming workflow tools:

  14. Promoter Identification Workflow (PIW)

  15. Execution Semantics Promoter Identification Workflow in Kepler (SSDBM’03)

  16. What the future holds: All of the iGAP programs are “actors”, along with a number of other, alternative community codes. APST, the grid scheduler is an actor. Now the user can create and deploy their own unique iGAP with their favorite programs.

  17. 2. Integrated databases Encyclopedia of Life Notebook Interface allows scientists access to all federated databases User Interface Query interface Federation Encyclopedia of Life (55 annotated genomes) Protein Data Bank (15,000 structures) Joint Center for Structural Genomics (4,000 targets) Protein Kinase Resource (12,000 kinases) Alliance for Cell Signaling (3,000 mol. pages) Small Molecule QM Database (all PDB molecules) Data Sources ….6 databases can be queried by the entire community through a user interface, and portals will create new functionalities for data manipulation…..

  18. Show me if there is kinase domain with 90% homology to mine in PDB, or what if one of the Structural Genomics Centers is working on it Give me all the proteins that have my kinase domain, and tell me what the AfCS is studying that is closely related. Encyclopedia of Life (55 annotated genomes) Protein Data Bank (15,000 structures) Joint Center for Structural Genomics (4,000 targets) Protein Kinase Resource (12,000 kinases) Alliance for Cell Signaling (3,000 mol. pages) Small Molecule QM Database (all PDB molecules) Interface allows scientists access to all federated databases with blast wrappers and HMMR wrappers: Blast Wrapper HMMR Wrapper Federation Data Sources

  19. Interface allows scientists to integrate experimental data with all federated databases Show me the homologues to my chosen microarray sequence, and tell me what is known in Medline about it. Text Wrappers Excel Wrappers Medline Wrappers Encyclopedia of Life (55 annotated genomes) Protein Data Bank (15,000 structures) Joint Center for Structural Genomics (4,000 targets) Protein Kinase Resource (12,000 kinases) Alliance for Cell Signaling (3,000 mol. pages) Small Molecule QM Database (all PDB molecules) Federation Data Sources

  20. Issues in database integration: • How do we make logical federation across multiple databases? • Sit down with the creators of each database and federate elegantly. This non-scalable solution was useful in finance. • Use a brute-force search tool to locate fields common to each database. Such a schema-smashing tool is under development.

  21. 3. Innovative data access tools • Access to software that is interoperable • Access to cycles to run the software • Local resources • Grid mapping/scheduling tools for remote resources • Data Visualization Tools • User control of the toolkit • Access to federated data resources • Workflow Tools These are largely solved problems, but the solutions must be tailored to each user community…..

  22. An example workbench

  23. Turn on tracking feature Select an organism

  24. Select a sequence

  25. Display sequence

  26. Display properties

  27. Lightweight SVG graphics to manipulate structure

  28. What the future holds:User requirements for the Next Generation Biology Workbench • Can be used with a browser/dial-up modem • Low requirements of the users machine • Requires no install of plug-ins • Provides a variety of useful programs with interoperable data formats • Provides flexibility/modularity in adding new functions • Web services based • User-configurable interface • Provides access to federated data sources • Offer the option of a more intelligent client interface

  29. What does an electronic notebook provide? The Notebook will serve as a “web browser” for SOAP services. It will feature a local database to store results of computations, results of searches, notify you of new data updates available, and enable peer-to-peer data sharing.

  30. What are the data issues for the notebook? • How to store and organize data in a manner that accommodates a wide variety of data types, from text strings to large images. • How to capture data, and offload to the public data resource.

  31. CYBER INFRASTRUCTURE FOR PHYLOGENETIC RESEARCH BUILDING THE TREE OF LIFE: A NATIONAL RESOURCE FOR PHYLOINFORMATICS AND COMPUTATIONAL PHYLOGENETICS The CIPRes project is funded by NSF Information Technology Research (ITR) program grants entitled "BUILDING THE TREE OF LIFE: A National Resource for Phyloinformatics and Computational Phylogenetics"

  32. Data Issues for the Phylogeny Community • Devise new methodologies for storing trees and graphs – RDF? • How to capture data in remote environs – Notebook Project • Create a production repository of phylogenetic information – Treebase

  33. Treebase • www.treebase.org • relational database of phylogenetic information • Types of data include: Users, Studies, Analysis, Matrices, Taxa, Trees • Current treebase schema

  34. General Problems/Improvements? • User interface is outdated and difficult to use, not easy to search and pull up desired records. For example, treebase does not support for distinct taxon name. • Difficult to retrieve all data in an analysis for end user, tweak certain parameters, and replicate. • Lacks an electronic notebook-like features where users can easily access their own data and/or studies of interest. • Lacks more search options such as Boolean operators. • More dependable links to outside sources, morphological data, sequence data and etc. • Basically, we’re not currently getting the most out of the data we have … and therefore treebase2

  35. Specific problems • Efficiency of db queries / how best to store the data (taxa, matrices) • How to store trees (rooted and un-rooted) in a relational database and be able to do quick searches by taxa and sub trees. • Recursive queries such as: All ancestors of species A, Least Common Ancestor of species A and etc. • Difficult to get a consensus from the community on what is the best way to proceed.

  36. Our Approach • Focus on getting things done one step at a time • Problems we want to address immediately (in early versions of treebase2) include: 1. Richer user interface / better site design 2. Improved search capabilities (by distinct taxa, references) 3. Give users more options with visualization tools for trees So based on that, we have the treebase2 version 0.01 schema and current mirror site: http://www.phylo.org/treebase

  37. Overall View Treebase2 Object model Java Thick Client Web Interface

  38. Future plans summary • Do more with trees and data! • Improve current submission and curation process – more papers /analysis/studies into treebase2, faster • Improve web interface for users • Increase user community & interaction

More Related