Linking Multiple Workflow Provenance Traces for Collaborative Science

Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science • Paolo Missier(1), Bertram Ludäscher(2), Shawn Bowers(3), • Saumen Dey(2), Anandarup Sarkar(3), Biva Shrestha(4), • Ilkay Altintas(5), Manish Kumar Anand(5), Carole Goble(1) • School of Computer Science, University of Manchester • Dept. of Computer Science, University of California, Davis • Dept. of Computer Science, Gonzaga University • Dept. of Computer Science, Appalachian State University • San Diego Supercomputer Center, University of California, San Diego WORKS’10, New Orleans

Context: Data Sharing • Implicit collaboration through data sharing • Alice uses nth generation input dataset x and produces n+1stoutput dataset z • … as part of run RA of workflow WA • … output zis published in some data-space. • Bob uses Alice’s outputs zand produces n+2nd generation dataset v • … using workflow WB, possibly with pre-processing f • Alice and Bob may not know each other

Motivation: Virtual Joint Experiments • How do we ensure that Charlie gets a complete account of the history of Wc’s outputs? • How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v?  traces TA and TB will be critical  need to compose them to obtain TC We can view the composition WC as a new, virtual workflow

Provenance Composition: the Data Tree of Life (DToL) • We can formulate our questions in terms of provenance of the datasets produced by virtual workflow WC: • What is the completeprovenance of v? • Answering the question requires tracing v’s derivation all the way to x • But, to achieve this, we need to ensure: • TA and TB are properly connected • Provenance queries run seamlessly over and across TAand TB

Test scenario: 1st Provenance Challenge Workflow • DataONE Summer-of-Code Project • Split First Provenance Challenge workflow at various points • Publish Part-I from system X, use as input for Part-II on system Y • X, Y in { Kepler/SDF, Kepler/COMAD, Taverna}

Common Model of Provenance (approx. OPM) Data provenance for a single workflow run is well understood • Workflow spec: digraph • W= (VW, EW) • VW = A∪ C • actors A (processors) • channels C (FIFO data buffers) • EW = Ein∪ Eout • in edges Ein⊆ A x C • out edges Eout⊆ C x A • Trace graph: acyclic digraph • T = (VT, ET) • VT = I∪ D (invocations I, data D) • ET = Eread∪ Ewrite • read edges Eread⊆ D x I • write edges Ewrite⊆ I x D TAtrace instance of WA: h: TA ➔ WAhomomorphism h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A, h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y ...

Data and Invocation Dependencies (ddep, idep) Explicit or via: Explicit or via: • data dependencies: • invocation dependencies: - read, write are natural observables for a workflow run - possible additional relations (recorded or inferred): “a2 depends on a1” because a1 has written data d, a2 has read d “d2 depends on d1” … because some actor invocation a read d1 prior to writing d2 (Note: in some models of computation the rules above are not correct)

Provenance queries • Local (“non-closure”) queries on a trace T: • Find the data and traces published by Alice / Bob • Find the inputs, outputs, and intermediate data products of T • Find (selected) actors and channels used in T • Find inputs and outputs of an invocation ai in T Easy and not very interesting E.g. answer to (3) is just the set of nodes in h(T) • Closure queries: • operate on the transitive closure ddep* over ddep: • suppose ddep* spans multiple traces TA, TB • we must define the standard query: so that it operates on the composition of TA, TB

Issues in Provenance Composition • I - Trace disconnect: • II - Model heterogeneity: • III - Data identifiers mismatch Closure queries now must span multiple provenance traces • different workflow and provenance models • traces that should “join” on the shared data, are really disconnected • make data sharing process itself provenance-aware • common provenance model with local ➔ global mapping • assert data equivalence as part of provenance • different workflows adopt different data identification schemes • Main problems and approaches: • heterogeneity of both workflow and provenance models

Part I – Provenance Stitching • The missing link: make every data copy step provenance-aware - r : data reference in store S - trace-equivalence of data items d in S, d’ in S’: d ≃d’ if d’ is obtained by copying d from S to S’:

Part II - Mapping to a Common Provenance Model • Mapping rules (= code, queries) defined from Kepler and Taverna provenance models to common model (details omitted): In the result TP each reference r found in TS is replaced with ρ(r) • OPM used as intermediate target model • … doesn’t “nail” everything • a mixed blessing … • … but team-work made it work!

Part III – Data Identifier Reconciliation • We have seen that the copy operation … r’ = copy(r, S, S’) • … on shared data store S generates a data equivalence assertion • It also keep track of ID mappings: added to renaming map from a set of S-specific references to a set of public references

Extended (across-runs) Provenance Queries • Closure queries are redefined on the extended provenance trace that includes trace-equivalences d≃d’ as follows: for instance between

Prototype Architecture

Conclusions 1/2 • In theory, provenance interoperability should be solved/easy using e.g. OPM • In practice it isn’t (cf. Provenance Challenge workshops), e.g. • different mappings to OPM • different identifier schemes • traces broken “at the seams” • Summer-of-code DToL prototype demonstrates feasibility of provenance-aware collaboration / workflow interoperation through data • Extends potential of provenance analysis beyond isolated workflow-based experiments • Findings relevant for data preservation in • Tracing data access is key

Conclusions 2/2 • DataONE: • http://www.dataone.org/ • Data Tree-of-Life (DToL Summer Project) • https://sites.google.com/site/datatolproject/ • Runtime wf systems interoperability can be very hard • … and benefits not clear (unless “layered” approach w/ different roles of wf systems) • wf provenance interoperability to the rescue! • Next Steps: • DataONE Working Group on Provenance for Scientific Workflows • Develop DOPM (DataONE Provenance Model; OPM++)

Linking Multiple Workflow Provenance Traces for Collaborative Science

Linking Multiple Workflow Provenance Traces for Collaborative Science

Presentation Transcript

Provenance in a Collaborative Bio-database RAASWiki

Linking Data from Multiple Sources

Workflow Provenance

Multiple File Compilation and Linking

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

Kepler, Provenance, and other Scientific Workflow Systems

Linking Literature and Science

Collaborative Data Sharing with Mappings and Provenance

Collaborative Data Sharing with Mappings and Provenance

Collaborative environment and workflow decomposition for remote instrumentation

Science Education Collaborative

Linking Science and Society

Workflow evolution provenance and OPM

Privacy Issues in Scientific Workflow Provenance

Provenance in Earth Science

Enabling Privacy in Provenance-Aware Workflow Systems

Querying Workflow Provenance

Collaborative Data Sharing with Mappings and Provenance

Provenance (for Earth science data)

Workflow Benefits of Collaborative Software

Sea Ice

Sea Ice