1 / 62

Provenance: an open approach to experiment validation in e-Science

Provenance: an open approach to experiment validation in e-Science. Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm. Provenance & PASOA Teams. University of Southampton

xuxa
Download Presentation

Provenance: an open approach to experiment validation in e-Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance: an open approach to experiment validation in e-Science Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm

  2. Provenance & PASOA Teams • University of Southampton • Luc Moreau, Victor Tan, Paul Groth, Simon Miles, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Luc Moreau • IBM UK (EU Project Coordinator) • John Ibbotson, Neil Hardman, Alexis Biller • University of Wales, Cardiff • Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten • Universitad Politecnica de Catalunya (UPC) • Steven Willmott, Javier Vazquez • SZTAKI • Laszlo Varga, Arpad Andics • German Aerospace • Andreas Schreiber, Guy Kloss, Frank Danneman

  3. Contents • Motivation • Provenance Concepts • Provenance Architecture • Standardisation • E-Science experiment validation • Conclusion

  4. Motivation

  5. Scientific Research Academic Peer Review

  6. Audit (Basel II) Audit (Sabanes-Oxley) Business Regulations Accounting Banking

  7. Health Care Management European Recommendation R(97)5: on the protection of medical data

  8. e-Science • How to undertake peer-reviewing and validation of e-Scientific results?

  9. The “next-compliance” problem Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance? Compliance to Regulations

  10. Current Solutions • Proprietary, Monolithic • Silos, Closed • Do not inter-operate with other applications • Not adaptable to new regulations

  11. Provenance • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the ultimate derivation and passage of an item through its various owners. • Concept vs representation

  12. Provenance in Computer Systems • Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data • Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

  13. Our Approach • Define core concepts pertaining to provenance • Specify functionality required to become “provenance-aware” • Define open data models and protocols that allow systems to inter-operate • Standardise data models and protocols • Provide a reference implementation • Provide reasoning capability

  14. Provenance Concepts

  15. Application Results Record Documentation of Execution Core Interfaces to Provenance Store Provenance “Lifecycle” Provenance Store Query and Reason over Provenance of Data Administer Store and its contents

  16. Nature of Documentation • We represent the provenance of some data by documenting the process that led to the data: • documentation can be complete or partial; • it can be accurate or inaccurate; • it can present conflicting or consensual views of the actors involved; • it can provide operational details of execution or it can be abstract.

  17. p-assertion • A given element of process documentation will be referred to as a p-assertion • p-assertion: is an assertion that is made by an actor and pertains to a process.

  18. Service Oriented Architecture • Broad definition of service as component that takes some inputs and produces some outputs. • Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. • Interactions with services take place with messages that are constructed according to services interface specification. • The term actor denotes either a client or a service in a SOA. • A process is defined as execution of a workflow

  19. I received M1, M4 I sent M2, M3 I received M3 I sent M4 Process Documentation (1) From these p-assertions, we can derive that M3 was sent by Actor 1 and received by Actor 2 (and likewise for M4) Actor 2 Actor 1 M1 M3 If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages M4 M2

  20. M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M4 is in reply to M3 Process Documentation (2) Actor 2 Actor 1 M1 M3 These assertions help identify order of messages, but not how data were computed M4 M2

  21. f1 f f2 M3 = f1(M1) M2 = f2(M1,M4) M4 = f(M3) Process Documentation (3) Actor 2 Actor 1 M1 M3 These assertions help identify how data is computed, but provide no information about non-functional characteristics of the computation (time, resources used, etc) M4 M2

  22. I used sparc processor I used algorithm x version x.y.z I used 386 cluster Request sat in queue for 6min Process Documentation (4) Actor 2 Actor 1 M1 M3 M4 M2

  23. I received M1, M4 I sent M2, M3 Types of p-assertions (1) • Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message

  24. M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M3 = f1(M1) M2 = f2(M1,M4) Types of p-assertions (2) • Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained output data or the whole message sent in an interaction by applying some function to input data or messages from other interactions.

  25. Types of p-assertions (3) • Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x.y.z

  26. Data flow • Interaction p-assertions allow us to specify a flow of data between actors • Relationship p-assertions allow us to characterise the flow of data “inside” an actor • Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

  27. Provenance Architecture

  28. Application Results Record Documentation of Execution Interfaces to Provenance Store Provenance Store Query and Reason over Provenance Of Data Administer Store and its contents

  29. Data Structures

  30. Abstract machines DS Properties Termination Liveness Safety Statelessness Documentation Properties Immutability Attribution Datatype safety Foundation for adding necessary cryptographic techniques Recording Protocol (Groth04,05)

  31. Querying Functionality (Miles05) • Obtain the provenance of some specific data • Allow for “navigation” of the documentation of execution • Allows us to view the provenance store as if containing XML data structures • Independent of technology used for running application and internal store representation • Seamless navigation of application dependent and application independent provenance representation • Filter in/out p-assertions according to actors, process, types of relationships, etc

  32. Available Software • PReServ (Paul Groth & Simon Miles) • Offer recording and querying interfaces • Available from www.pasoa.org • Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)

  33. Standardisation

  34. Standardisation Options

  35. Record Documentation of Execution Purpose of Standardisation Application Application Provenance Stores Allow for multiple applications to document their execution. Applications may be running in different institutions.

  36. Record Documentation of Execution Purpose of Standardisation Application Provenance Store Provenance Store Provenance Store Allow for multiple stores from multiple IT providers

  37. Purpose of Standardisation Provenance Store Provenance Store Query Provenance Of Data Allow for multiple stores from multiple IT providers

  38. Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to expose their contents (according to standard schema)

  39. Provenance Store Purpose of Standardisation Application Allow third parties to host provenance stores, which are trusted by application owners but also auditors

  40. Compliance verification Record Documentation of Execution Compliance Oriented Architectures Application Provenance Store Query Provenance Of Data

  41. Compliance Oriented Architectures • Separate execution documentation from compliance verification • Allows for multiple compliance verifications • Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). • Approach is suitable for e-scientific peer-reviewing and business compliance verification

  42. Provenance Based Validationin e-Science (iswc’05)

  43. Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

  44. Context (2) Bioinformatics: verification and auditing of “experiments” (e.g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

  45. The case for Provenance-based Validation • Static Validation • Operates on workflow source code • Programming language static analyses (e.g. type inference, escape analysis, etc.) • Workflow specific (concurrency analysis, graph-based partitioning, model checking, quality of service) • Workflow script may not be accessible or may be expressed in a language not supported by analysis tool • Dynamic Validation • Service based: interface matching, runtime type checking

  46. The case (2) • Provenance based reasoning • Allows for validation of experiments after execution • Third parties, such as reviewers and other scientists, may want to verify that the results obtained were computed correctly according to some criteria. • These criteria may not be known when the experiment was designed or run. • Important because science progresses (and models evolve!)

  47. Bioinformatics Scenario • A biologist has a set of proteins, for each of which he/she wishes to determine a particular biological property ?

  48. Experiment Design • The biologist designs a high-level plan of an experiment, describing each activity that must be performed • Each activity determines new information from analysing the information discovered in previous steps • The steps can be seen as a linked flow of data from the protein to the final property HIGH LEVEL PLAN ?

  49. Experiment Services • Service C • ………. • ……… • …………….. • …. • For each activity in the plan, the biologist must decide on the concrete service they will perform • Each service may be designed by the biologist him/herself or adopted from the work of another biologist • For each service there is a description of that service stating: • what the service does • what type of data it analyses (its inputs) and • what type of results it produces (its outputs) • All the descriptions are stored in a registry • Service B • ………. • ……… • …………….. • …. • Service A • ………. • ……… • …………….. • …. Registry Description of Service A Function: ….. Inputs: ….. Outputs: …..

More Related