provenance an open approach to experiment validation in e science l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Provenance: an open approach to experiment validation in e-Science PowerPoint Presentation
Download Presentation
Provenance: an open approach to experiment validation in e-Science

Loading in 2 Seconds...

play fullscreen
1 / 62

Provenance: an open approach to experiment validation in e-Science - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Provenance: an open approach to experiment validation in e-Science. Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm. Provenance & PASOA Teams. University of Southampton

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Provenance: an open approach to experiment validation in e-Science' - xuxa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
provenance an open approach to experiment validation in e science

Provenance: an open approach to experiment validation in e-Science

Professor Luc Moreau

L.Moreau@ecs.soton.ac.uk

University of Southampton

www.ecs.soton.ac.uk/~lavm

provenance pasoa teams
Provenance & PASOA Teams
  • University of Southampton
    • Luc Moreau, Victor Tan, Paul Groth, Simon Miles, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Luc Moreau
  • IBM UK (EU Project Coordinator)
    • John Ibbotson, Neil Hardman, Alexis Biller
  • University of Wales, Cardiff
    • Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten
  • Universitad Politecnica de Catalunya (UPC)
    • Steven Willmott, Javier Vazquez
  • SZTAKI
    • Laszlo Varga, Arpad Andics
  • German Aerospace
    • Andreas Schreiber, Guy Kloss, Frank Danneman
contents
Contents
  • Motivation
  • Provenance Concepts
  • Provenance Architecture
  • Standardisation
  • E-Science experiment validation
  • Conclusion
scientific research
Scientific Research

Academic Peer Review

business regulations

Audit (Basel II)

Audit (Sabanes-Oxley)

Business Regulations

Accounting

Banking

health care management
Health Care Management

European Recommendation R(97)5: on the protection of medical data

e science
e-Science
  • How to undertake peer-reviewing and validation of e-Scientific results?
compliance to regulations
The “next-compliance” problem

Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance?

Compliance to Regulations
current solutions
Current Solutions
  • Proprietary, Monolithic
  • Silos, Closed
  • Do not inter-operate with other applications
  • Not adaptable to new regulations
provenance
Provenance
  • Oxford English Dictionary:
    • the fact of coming from some particular source or quarter; origin, derivation
    • the historyor pedigree of a work of art, manuscript, rare book, etc.;
    • concretely, a record of the ultimate derivation and passage of an item

through its various owners.

  • Concept vs representation
provenance in computer systems
Provenance in Computer Systems
  • Our definition of provenance in the context of applications for which process matters to end users:

The provenance of a piece of data is the process that led to that piece of data

  • Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases
our approach
Our Approach
  • Define core concepts pertaining to provenance
  • Specify functionality required to become “provenance-aware”
  • Define open data models and protocols that allow systems to inter-operate
  • Standardise data models and protocols
  • Provide a reference implementation
  • Provide reasoning capability
provenance lifecycle

Application

Results

Record Documentation of Execution

Core Interfaces to Provenance Store

Provenance “Lifecycle”

Provenance

Store

Query and

Reason over

Provenance

of Data

Administer

Store and its

contents

nature of documentation
Nature of Documentation
  • We represent the provenance of some data by documenting the process that led to the data:
    • documentation can be complete or partial;
    • it can be accurate or inaccurate;
    • it can present conflicting or consensual views of the actors involved;
    • it can provide operational details of execution or it can be abstract.
p assertion
p-assertion
  • A given element of process documentation will be referred to as a p-assertion
    • p-assertion: is an assertion that is made by an actor and pertains to a process.
service oriented architecture
Service Oriented Architecture
  • Broad definition of service as component that takes some inputs and produces some outputs.
  • Services are brought together to solve a given problem typically via a workflow definition that specifies their composition.
  • Interactions with services take place with messages that are constructed according to services interface specification.
  • The term actor denotes either a client or a service in a SOA.
  • A process is defined as execution of a workflow
process documentation 1

I received M1, M4

I sent M2, M3

I received M3

I sent M4

Process Documentation (1)

From these p-assertions, we can derive that M3 was sent by Actor 1

and received by Actor 2 (and likewise for M4)

Actor 2

Actor 1

M1

M3

If actors are black boxes, these assertions are not very useful because

we do not know dependencies between messages

M4

M2

process documentation 2

M2 is in reply to M1

M3 is caused by M1

M2 is caused by M4

M4 is in reply to M3

Process Documentation (2)

Actor 2

Actor 1

M1

M3

These assertions help identify order of messages,

but not how data were computed

M4

M2

process documentation 3

f1

f

f2

M3 = f1(M1)

M2 = f2(M1,M4)

M4 = f(M3)

Process Documentation (3)

Actor 2

Actor 1

M1

M3

These assertions help identify how data is computed,

but provide no information about non-functional

characteristics of the computation

(time, resources used, etc)

M4

M2

process documentation 4

I used sparc

processor

I used algorithm

x version x.y.z

I used 386 cluster

Request sat in

queue for 6min

Process Documentation (4)

Actor 2

Actor 1

M1

M3

M4

M2

types of p assertions 1

I received M1, M4

I sent M2, M3

Types of p-assertions (1)
  • Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message
types of p assertions 2

M2 is in reply to M1

M3 is caused by M1

M2 is caused by M4

M3 = f1(M1)

M2 = f2(M1,M4)

Types of p-assertions (2)
  • Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained output data or the whole message sent in an interaction by applying some function to input data or messages from other interactions.
types of p assertions 3
Types of p-assertions (3)
  • Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction

I used sparc

processor

I used algorithm x

version x.y.z

data flow
Data flow
  • Interaction p-assertions allow us to specify a flow of data between actors
  • Relationship p-assertions allow us to characterise the flow of data “inside” an actor
  • Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result
interfaces to provenance store

Application

Results

Record Documentation of Execution

Interfaces to Provenance Store

Provenance

Store

Query and

Reason over

Provenance

Of Data

Administer

Store and its

contents

recording protocol groth04 05
Abstract machines

DS Properties

Termination

Liveness

Safety

Statelessness

Documentation Properties

Immutability

Attribution

Datatype safety

Foundation for adding necessary cryptographic techniques

Recording Protocol (Groth04,05)
querying functionality miles05
Querying Functionality (Miles05)
  • Obtain the provenance of some specific data
  • Allow for “navigation” of the documentation of execution
    • Allows us to view the provenance store as if containing XML data structures
    • Independent of technology used for running application and internal store representation
    • Seamless navigation of application dependent and application independent provenance representation
  • Filter in/out p-assertions according to actors, process, types of relationships, etc
available software
Available Software
  • PReServ (Paul Groth & Simon Miles)
  • Offer recording and querying interfaces
  • Available from www.pasoa.org
  • Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)
purpose of standardisation

Record Documentation

of Execution

Purpose of Standardisation

Application

Application

Provenance

Stores

Allow for multiple applications to document their execution.

Applications may be running in different institutions.

purpose of standardisation37

Record Documentation

of Execution

Purpose of Standardisation

Application

Provenance

Store

Provenance

Store

Provenance

Store

Allow for multiple stores from multiple IT providers

purpose of standardisation38
Purpose of Standardisation

Provenance

Store

Provenance

Store

Query

Provenance

Of Data

Allow for multiple stores from multiple IT providers

purpose of standardisation39
Purpose of Standardisation

Convert in standard data

format

Allow for legacy, monolithic applications to expose their

contents (according to standard schema)

purpose of standardisation40

Provenance

Store

Purpose of Standardisation

Application

Allow third parties to host provenance stores,

which are trusted by application owners but also auditors

compliance oriented architectures

Compliance

verification

Record Documentation

of Execution

Compliance Oriented Architectures

Application

Provenance

Store

Query

Provenance

Of Data

compliance oriented architectures42
Compliance Oriented Architectures
  • Separate execution documentation from compliance verification
  • Allows for multiple compliance verifications
  • Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting).
  • Approach is suitable for e-scientific peer-reviewing and business compliance verification
context 1
Context (1)

Aerospace engineering: maintain a historical record of design processes, up to 99 years.

Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

context 2
Context (2)

Bioinformatics: verification and

auditing of “experiments” (e.g.

for drug approval)

High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

the case for provenance based validation
The case for Provenance-based Validation
  • Static Validation
    • Operates on workflow source code
    • Programming language static analyses (e.g. type inference, escape analysis, etc.)
    • Workflow specific (concurrency analysis, graph-based partitioning, model checking, quality of service)
    • Workflow script may not be accessible or may be expressed in a language not supported by analysis tool
  • Dynamic Validation
    • Service based: interface matching, runtime type checking
the case 2
The case (2)
  • Provenance based reasoning
    • Allows for validation of experiments after execution
    • Third parties, such as reviewers and other scientists, may want to verify that the results obtained were computed correctly according to some criteria.
    • These criteria may not be known when the experiment was designed or run.
    • Important because science progresses (and models evolve!)
bioinformatics scenario
Bioinformatics Scenario
  • A biologist has a set of proteins, for each of which he/she wishes to determine a particular biological property

?

experiment design
Experiment Design
  • The biologist designs a high-level plan of an experiment, describing each activity that must be performed
  • Each activity determines new information from analysing the information discovered in previous steps
  • The steps can be seen as a linked flow of data from the protein to the final property

HIGH

LEVEL

PLAN

?

experiment services
Experiment Services
  • Service C
  • ……….
  • ………
  • ……………..
  • ….
  • For each activity in the plan, the biologist must decide on the concrete service they will perform
  • Each service may be designed by the biologist him/herself or adopted from the work of another biologist
  • For each service there is a description of that service stating:
    • what the service does
    • what type of data it analyses (its inputs) and
    • what type of results it produces (its outputs)
  • All the descriptions are stored in a registry
  • Service B
  • ……….
  • ………
  • ……………..
  • ….
  • Service A
  • ……….
  • ………
  • ……………..
  • ….

Registry

Description of Service A

Function: …..

Inputs: …..

Outputs: …..

performing experiment

Provenance

Store

Performing Experiment
  • Service A
  • ……….
  • ………
  • …..
  • The biologist now performs the experiment as many times as they wish
  • The details of each experimental process is documented in a provenance store
  • This is achieved each service documenting its own execution using the recording interface
  • Service B
  • ……….
  • ………
  • …..
  • Service C
  • ……….
  • ………
  • …..
  • Service D
  • ……….
  • ………
  • …..
  • Service E
  • ……….
  • ………
  • …..
provenance questions
Provenance Questions
  • Later, the biologist examines the experiment results and has questions about the validity of process that produced them
    • Did the services I used actually fulfil my high level plan?
    • Two of the experiments were performed on the same protein but have different results – did I alter the services between these experiments?
    • Did I perform each service on the type of data that the service was intended to analyse, i.e. were the inputs and outputs of each activity compatible?
answering the questions
Answering the Questions
  • Using the documentation in the Provenance Store, we can reconstruct the process that led to each result
  • Along with the high level plan and the descriptions in the registry we have all the information required to answer the questions
q1 did the experiment follow the plan

Registry

Provenance

Store

Q1: Did the experiment follow the plan?

Retrieve

procedure

descriptions

Compare

procedure

function

to planned

activity

Description of Service A

Function: …..

Inputs: …..

Outputs: …..

Retrieve

documentation

of experiment

that led to a

result

A

?

q2 do services differ between experiments

Provenance

Store

Q2: Do services differ between experiments?

Retrieve documentation of experiments

  • Service A
  • ……….
  • ………
  • ……………..
  • Service A
  • ……….
  • ………
  • ……………..
  • ….

Highlight differences in services between experiments

q3 were the inputs and outputs compatible

Provenance

Store

Registry

Q3: Were the inputs and outputs compatible?

Retrieve descriptions for

each service

A

Retrieve each pair of services performed in an experiment, where one service’s output is the other’s input

B

Description of Service A

Function: …..

Inputs: …..

Outputs: …..

Description of Service B

Function: …..

Inputs: …..

Outputs: …..

Compare the output type of the first service with the input type of the second

ontological reasoning
Ontological Reasoning

Ontology

  • In some cases, the high-level activity may be described in a more general way than the service which performs it
  • Also, one service’s input may be a generalisation of the preceding service’s output
  • Therefore, exact matching of types may produce a false negative: the biologist will wrongly be told the experiment was invalid
  • By using an ontology, describing how types are related, we can reason about types and determine whether they are truly compatible

Protein

is generalisation

of

Human

Protein

to sum up

Standardising the

documentation of

Business Processes

Apply

Record

  • Provenance
    • Architecture
    • Methodology
  • Compliance check
  • Rerun/Reproduce
  • Analyse

Provenance

Store

Query

To Sum Up

Finance

Distribution

Aerospace

Healthcare

Automobile

Pharmaceutical

Slide from John Ibbotson

conclusions60
Conclusions
  • Crucial topic for many applications
  • Full architectural specification
  • An implementation available for download
  • Methodology to make application provenance-aware
  • www.pasoa.org
  • www.gridprovenance.org
publications
Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005.

Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004.

Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004.

Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005.

Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In , 2003.

Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005.

Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.

Publications