1 / 12

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

UC DAVIS Department of Computer Science. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. The Kepler/pPOD Team Shawn Bowers , Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher

netis
Download Presentation

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UCDAVIS Department of Computer Science Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life The Kepler/pPOD Team Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher DAKS Lab, Genome Center, Univ. of California at DavisDept. of Computer Science, Univ. of California at Davis

  2. Background “The AToL initiative (Assembling the Tree of Life) is a large research effort sponsored by the National Science Foundation. Its goal is to reconstruct the evolutionary origins of all living things.” – http://atol.sdsc.edu AToL projects • Investigate relationships among specific groups of organisms • Develop new computational techniques • Expectation that projects will collaborate & share data Technology barriers • Exchanging data between collaborators & other projects • Data “lives” in many different kinds of applications • Similar analyses performed, but ad hoc (manually or scripts) • Provenance of data and results

  3. Project Overview pPOD (processing phylodata) • Develop core database technologies for the AToL community • Data access, data integration, scientific analysis, provenance • Collaboration among Univ. of Pennsylvania, Yale Univ., Univ. of Florida, and UC Davis Kepler/pPOD @ UC Davis • Scientific workflows for phylogenetic data analysis • Workflow execution and data provenance

  4. Existing Applications Tolkin Workflow Automation (Kepler/pPOD) mappings to core model (via Orchestra) TreeBASE Core AToL Data Model • Tools & analyses • Integrate w/ data model • Provenance recording withinand across workflow runs AToL Lab DB Data Integration & Exchange (Orchestra) • Data types for sequences, trees, … • Provenance relationships • Expressive query language (OQL) • Persistence tools • Application schema mappings • Curation (w/ provenance) • Privacy and trust policies • P2P support Basic architecture

  5. Kepler/pPOD workflows Uses • Sequence alignment, tree inference, post-tree analysis, … • Track analyses run and data produced within projects • Use, test, compare different computational techniques Characteristics • Exploratory (design, run, modify, commit, …) • Intertwined with manual steps (e.g., edit alignment) • Many formats, few data types (sequences, trees, matrices, …) • Pipelined (e.g., multiple sets of sequences) Kepler/pPOD Status • “Preview release” of Kepler/pPOD: Kepler + pPOD extensions • workflow design (via Comad) • wrapped apps: Phylip, Clustal, MrBayes, RaXML, tree drawing, … • provenance recording and browsing

  6. GUI components workspace extension access to workflows access to run “traces” new director data types, collections assembly-line processing provenance enabled actor library Cipres web services local applications format conversion Kepler/pPOD workflows

  7. Kepler/pPOD workflows integrated provenance browser data & process dependencies “forward” & “rewind” run multiple views

  8. Proj Seqs Trees Aligns Compute Consensus … … S1 S10 A2 T1 T5 T6 T5 T1 A2 A1 S10 S1 A1 T6 … … … … < < > < > < > > <Seq> <Proj> </Seq> </Proj> <Trees> </Trees> </Aligns> <Aligns> Comad: “Virtual Assembly Lines” • Actors select parts of token stream, forward rest • Special tokens denote collections, metadata, & parameters • Actors insert tokens into and remove tokens from stream • Some advantages of Comad • workflows with loops, branches, composition (subworkflows) • concurrency, pipelining • resilient to change (data nesting, add/remove actors) • simpler workflow designs

  9. “Conventional” All of X and Y stored for A1 A1 X Y A1 “Comad” Store change and explicit dependenciesfor A1 … … … … del(A1) ins(A1) … but (efficiently) representing provenance? • Many approaches require storing all input and output for each actor invocation (transformers) • can lead to significant redundancy in Comad • We use an “XML-diff” approach augmented with data provenance • special provenance tokens … • … insertions, (marked) deletions, invocation dependencies • exploit collections and apply inference rules • only store final result containing input and provenance

  10. Kepler/pPOD Provenance Browser • Reusable “widgets” for viewing different aspects of a trace • Move “forward” and “backward” through execution • Data dependencies, collection structure, actor invocations

  11. Kepler/pPOD Provenance Browser • Collection and invocation view • Incrementally step through execution history • Actor invocation graph shows pipelining, implicit branches

  12. Poster/Demo & Questions … • Please come to our poster/demo :-) • Preview release of Kepler/pPOD available • http://daks.ucdavis.edu/kepler-ppod • Ongoing and future work • Adding more actors for phylogenetic analyses • Extending with “project histories” • Incremental query support • Integrate with AToL Core Data Model

More Related