1 / 15

Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas

Provenance Management in a CO llection-oriented Scientific W orkflow Framework aka Kepler/DAKS (for Luc’s collection: before: “We do provenance!”; now: “ … and it almost killed us!”). Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas

phyre
Download Presentation

Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance Management in a COllection-oriented Scientific Workflow Frameworkaka Kepler/DAKS(for Luc’s collection: before: “We do provenance!”; now: “ … and it almost killed us!”) Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas Norbert Podhorszki

  2. Goals for the Provenance Challenge Implement an RWS-style provenance model for Collection-Oriented Scientific Workflows • Take advantage of Collection-Oriented SWFs to • Automatically infer state-reset events • Reduce the number of provenance-relevant events that need to be recorded (keep it minimal) • Simplify association of traces and provenance into one self-contained “trace” file for input, output, and dependencies • Support science-oriented provenance and queries • Emphasize data dependencies (lineage) as well as process details • Decouple provenance representation from particular scientific workflow technology (Kepler)

  3. Collection-Oriented Workflows Generic support for workflows that operate over nested data collections (trees) • Abstract Model • Actors receive input trees, read contents of subtrees matching some criteria (scope), and optionally add or delete subtree nodes • Each scope instance corresponds to one actor invocation … … AnatomyImage AnatomyImage AlignWarp 1 1 2 3 2 Image ReferenceImage WarpParamSet Image ReferenceImage Scope = AnatomyImage

  4. Collection-Oriented Workflows • Kepler Implementation • Collections are serialized within heterogeneous token streams • Actor execution is pipelined based on each actor’s scope • Enables concurrent processing of nested data collections • Collections can contain data, metadata, actor parameters, and other collections

  5. Collection-Oriented Provenance Challenge SWF • Input data is read by collection reader • Execution driven by number and size of anatomy image sets specified by XML file • Slicer configured on the fly via parameter tokens • E.g. to create the 3 slices required for each image set • Output trace serialized into XML by collection writer • Trace implicitly contains input data, output data, and lineage

  6. Embedded Provenance Tokens Data and invocation dependencies stored as tokenswithin the stream Actor API for declaring data dependencies Invocation dependencies added automatically Data Dependencies Insertion and deletion events capture actor, invocation count, and direct data dependencies Process Dependencies Invocation dependencies record which steps created data or modified collections used by another actor invocation Collection-Oriented Provenance … AnatomyImage WarpParamSet Image ReferenceImage Insertion Dependencies

  7. Minimal Provenance Information Without Provenance With Provenance

  8. Querying Collection-Oriented Provenance Execution traces imply provenance graphs Graph edges encode data lineage and process relations Lineage(Trace, Node, DependentNode, Actor, InvocCount) Provenance operations work over traces and graphs: Input(Trace, Node) Output(Trace, Node) Param(Trace, Name, Value, Actor, InvocCount) Metadata(Trace, Key, Value, Node) etc. Slicer : 1 Slicer : 1 Image (311) AtlasImage (308) AtlasImage (308) AtlasSlice (337) Header (312) Header (312) Image (311) AtlasSlice (337) Header (312) Image (311) Slicer : 1 Slicer : 1 Slicer : 1 Data/Collection creation lineage Collection “last version” lineage

  9. Challenge Results We used two different runs • Each run has embedded metadata and parameter settings • First run equivalent to challenge workflow • Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection AnatomyImage AnatomyImage AnatomyImage AnatomyImage Image1 Header1 Reference Image2 Header2 Reference Image3 Header3 Reference Image4 Header4 Reference Image Header Image Header Image Header Image Header input to first run

  10. Challenge Results We used two different runs • Each run has embedded metadata and parameter settings • First run equivalent to challenge workflow • Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection ImageCollection ImageCollection AnatomyImage AnatomyImage AnatomyImage AnatomyImage … … … … AnatomyImage AnatomyImage … … AnatomyImage AnatomyImage AnatomyImage … … … input to second run

  11. Challenge Results (Trace 1) Full Data Dependencies Query: ?- trace(1, T), nodeId(T, 341, N1), nodeId(T, 349, N2), nodeId(T, 357, N3), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges).

  12. Challenge Results (Trace 1) • Question 1: Process that led to Atlas X Graphic Returns subset of lineage edges Query: ?- trace(1, T), nodeId(T, 341, N), lineageEdges(T, N, Edges), drawEdges(Edges).

  13. Challenge Results (Trace 2) • Question 1: Process that led to Atlas X Graphic Single workflow run where not all output dependent on all input. Query: trace(2, T), nodeId(T, 973, N1), nodeId(T, 1093, N1), nodeId(T, 1193, N1), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges).

  14. Summary Benefits of our approach • Provenance support for Collection-Oriented SWFs • Minimal provenance information stored in self-contained trace file • Provenance automaticallyembedded within data stream, simple actor provenance API • Able to answer provenance challenge queries using simple operations (see WIKI entry) -- Note that we ignored question 7 Suggestion for Future Provenance Challenge • More complex/realistic workflows (e.g., from Bioinformatics) • Loops, nesting, partial dependencies, concurrency • More “scientist-oriented” provenance queries • Explicit queries for data dependencies (e.g., see Wiki entry) • Assume user doesn’t know the structure of the trace (Queries 5)

  15. References • An Approach for Pipelining Nested Collections in Scientific Workflows, Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17, 2005. • A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher, Shirley Cohen, Susan B. Davidson. International Provenance and Annotation Workshop (IPAW'06), 2006. • Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data, Timothy McPhillips, Shawn Bowers, Bertram Ludaescher. 3rd International Workshop on Data Integration in the Life Sciences (DILS'06), 2006.

More Related