1 / 39

An Open Provenance Model for Scientific Workflows

An Open Provenance Model for Scientific Workflows . Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm. Provenance & PASOA Teams. University of Southampton

kwanita
Download Presentation

An Open Provenance Model for Scientific Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Open Provenance Model for Scientific Workflows Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm

  2. Provenance & PASOA Teams • University of Southampton • Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen • IBM UK (EU Project Coordinator) • John Ibbotson, Neil Hardman, Alexis Biller • University of Wales, Cardiff • Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari • Universitad Politecnica de Catalunya (UPC) • Steven Willmott, Javier Vazquez • SZTAKI • Laszlo Varga, Arpad Andics, Tamas Kifor • German Aerospace • Andreas Schreiber, Guy Kloss, Frank Danneman

  3. Contents • Motivation • Provenance Concept Map • Process documentation in a concrete bioinformatics application • Conclusions

  4. Motivation

  5. Peer Review/Audit Academic publishing Accounting Healthcare Banking

  6. e-Science datasets • How to undertake peer-reviewing and validation of e-Scientific results?

  7. Current Solutions • Proprietary, Monolithic • Silos, Closed • Do not inter-operate with other applications • Not adaptable to new regulations

  8. Provenance • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the passage of an item through its various owners. • Concept vs representation

  9. Application Drivers Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients Aerospace engineering: maintain a historical record of design processes, up to 99 years. Bioinformatics: verification and auditing of “experiments” (e.g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

  10. Provenance Concept Map

  11. documents Process is defined as a past Process Documentation has a structure Provenance ( concept ) Provenance Query produces is an execution of is represented by has Provenance operates over P - structure is obtained by ( representation ) contains Application Data product P - assertions assert consists of Services

  12. Application Data Product Assert p-assertions and record them as Process Documentation Making Applications Provenance Aware Provenance Store Obtain the provenance of data by issuing provenance queries

  13. f1 f2 Process Documentation I received M1, M4 I sent M2, M3 Interaction p-assertions M1 M3 M4 Service state p-assertions M2 Relationship p-assertions M3 = f1(M1) M2 = f2(M1,M4) M2 is in reply to M1 I received M1 at time t I used algorithm x.y.z

  14. Data flow • Interaction p-assertions allow us to specify a flow of data between services • Relationship p-assertions allow us to characterise the flow of data “inside” an service • Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

  15. Process Documentation in a Concrete Bioinformatics Application

  16. Biology • Determine how protein sequences fold into a 3D structure? • Structure of protein sequences may help to answer this question. • Structure can be quantified by textual compressibility. • Determine the amino acid groupings that maximize compressibility?

  17. Collaboration Diagram

  18. Actual Call DAG

  19. The P-Structure The logical structure of a provenance store

  20. Interaction Record The set of p-assertions pertaining to a given interaction (i.e., message exchange between a sender and a receiver)

  21. Interaction Key A unique identifier for an interaction Sender identity Receiver identity Local id

  22. View The set of p-assertions created by an asserter involved in an interaction (sender or receiver view)

  23. Asserter The identity of an asserter

  24. Interaction P-Assertion An assertion of the contents of a message by an actor that has sent or received that message

  25. Interaction P-Assertion Content The content of an interaction p-assertion: here, the invocation of blast (through a wrapper)

  26. Interaction Content Provenance-related information passed in application messages

  27. Actor State P-Assertion An assertion made by an actor about its internal state in the context of a specific interaction

  28. Relationship P-Assertion With respect to an interaction, a relationship p-assertion is an assertion, made by an actor, that describes how the actor obtained output data or the whole message sent in that interaction by applying some function to input data or messages from other interactions.

  29. Subject Id The identity of the subject of a relationship

  30. Object Id The identity of the object of a relationship

  31. Process Documentation Characteristics • Common logical structure of the provenance store shared by all asserting and querying actors • Can be produced autonomously, asynchronously by the different application components • Open, extensible model, for which we are producing a public specification • Tools can operate on it (e.g. visualisation, reasoning)

  32. Performance (HPDC’05)

  33. Standardisation Philosophy • Thin layer common between systems: extensible data model • Model can be extended for specific: • technologies (WS, Web, …), or • application domains (Bio, Healthcare, Desktop, …) • Service interfaces

  34. Proposed List of Specifications GenericProfiles Domain Specific Profiles WS-Prov-DM-Sec WS-Prov-Intro WS-Prov-DM-Link WS-Prov-Glo WS-Prov-DM-Infer WS-Prov-DM WS-Prov-DM-DS WS-Prov-Primer WS-Prov-DM-Rel WS-Prov-Rec WS-Prov-Query Technology Bindings WS-Prov-SOAP WS-Prov-WWW

  35. Conclusions

  36. Apply Record • Provenance • Architecture • Methodology Provenance Store To Sum Up Finance Distribution Aerospace Standardising the documentation of Business Processes Healthcare Automobile Pharmaceutical • Compliance check • Rerun/Reproduce • Analyse Query Slide from John Ibbotson

  37. Conclusions • Crucial topic for many applications • Full architectural specification • Implementation available for download • Methodology to make application provenance-aware • Draft standardisation proposal to be released • www.pasoa.org • www.gridprovenance.org

  38. Provenance Challenge Provenance Challenge Workshop at OGF18, Washington, September 11-14 twiki.ipaw.info

  39. Questions

More Related