Automated Provenance Capture Architecture and Implementation

Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL

Methodology • Simulated using Kepler workflow system. We did not attempt to leverage looping. • Programmed stub actors for each step with proper inputs, outputs, and user controlled parameters. • Implemented an execution event listener that optionally records workflow. No changes to core Kepler were made. • Applied/extended our existing content management/provenance system to see how far we could go with it. • Implemented actors /workflows for queries/visual analysis using xslt/graphviz

Provenance Capture Architecture Client Tools Workflow Tools events Naming Service Prov Capture Service Analysis Tools Workflow Engine Query Tools Prov Service Query Service Workflow UI Browsing Tools Annotation Tools Provenance/Content System Metadata extraction Triple Store Content Store Indexing Harvesting Tools Translation

SDG Provenance Capture Implementation All within Kepler Client Tools Workflow Tools events URL, LSID Kepler workflows Prov Capture Service Kepler Engine Node x Node Comparison URIQA (rdf), webDAV Bategalj/ Mrvar Algorithm URIQA(rdf),/ webDAV Kepler UI GML, webDav SEDASL Nettool Prov processor Query processor Metadata extraction Triple Store Content Store DavExplorer/ Ecce Defuddle Translation Lucene Triple Harvesters SAM

Physical Model Property can be a link or a value Property 0..n Named Thing 1 Content Any “thing”, for which we want to capture some information, is given a unique id with which properties and relationships can be associated. Additionally, content can be associated with these “things”.

Logical Overlay Workflow Instance isPartOf startedExecution finishedExecution creator wasRunBy owningInstitution created title Actor Instance 1 hasStatus format 0..n format startedExecution finishedExecution createdWith hasSource title uid [arbitrary triple]* uid hasParameter 1 1 [arbitrary triple]* isPartOf 0..n 1 isInput Parameter hasValue OR hasHashOf Value format title hasOutput uid 0..n Port Value [arbitrary triple]* format hasValue OR hasHashOf Value created title uid [arbitrary triple]*

Semantically Extended DASL Queries Select- all properties or a specific list - format (gxl, rdf, webDAV) Scope - a url or query (i.e. 2 phase) - names of properties to follow (and direction) - stop conditions (property/values comparisons, depth) Where - property name/valuecomparisons, content search

isInput-reverse hasOutput isPartOf-reverse • title • instantiationOf • source • value • format hasOutput-reverse isInput isPartOf Workflow Comparisons • Node-by-node comparisons • Nodes match if all node attributes and incoming and outgoing edges match • Nodes are similar if attributes and edges match to some specified XX% • After node comparisons, edges are compared • Edges match if connecting nodes were found to be exactly matching or similar and edge attributes match • Edges are similar if attributes match to some specified XX% • Outputs include: • Matching or similar nodes, • Matching or similar edges, • Nodes only in first or second graph, • Edges only in first or second graph Nodes only in First Graph: node52 (atlas-z.gif, ) node14 (imageformat, ) node53 (convertyimage,) node57 (atlas-y.gif, ) node36 (convertzimage) node34 (imageformat, ) node15 (imageformat, ) node78 (atlas-x.gif, ) node26 (convertximage) Count: 9

Workflow Graph Distances (0, 4, 6, 1,…) (0, 4, 6, 1,…) • Implements social network algorithm based on triad census (Batagelj and Mrvar, 2001; Chin, Whitney, Powers, and Johnson, 2004) • Examines every three possible nodes of a workflow graph • Every three possible combination fall into 1 of 64 possible triad states • Census is counts of triads that exist in the graph, which may be used to summarize or profile overall graph structure • Distance is computed by taking Euclidean distance of two triad censuses which is normalized to a 0.0..1.0 value • Most useful for assessing similarity across large, complex workflows • Distance computed for two workflow graphs: 0.095888

Combined rdf assertions with scientific content management flexible capabilities for metadata extraction (e.g. Defuddle to extract data from warp file). Existing rdf harvestors could be plugged in through same mechanism Extensible translation mechanism (browse tools can provide views of raw data such as a table of warp parameters) Conceptually simple model that can apply to much more than workflow execution. Readily adaptable to alternative models, constructs, relationships. Indexing and Query of content or metadata All relationships are reverse indexed automatically. You can search up or down and event mix directions on specific properties. Flexible event based model so as to minimize connections into workflow engine Actors can contribute their own metadata easily through events User control over which actors to capture provenance on Automatic content type determination Multiple output formats Capability to capture hashes instead of values Leveraged DASL extension mechanisms Based on existing standards (http) – existing tools can be leveraged Pluggable authentication model base on JAAS Everything is open source What’s Cool

Limitations • Prov capture is slow. We do one assertion at a time currently but they could all be packaged up into one request • RDF predicates can’t contain special characters but things like parameters often have these characters. • SAM can be made to work but current implementation based on WebDAV ties resources to metadata. We had to create dummy resources. • SAM not rdf based. • Big files (reference images) are duplicated as part of provenance tracking because they are data inputs to multiple actors. • Did not get to LSID service but it would be nice if this wasn’t a separate protocol to deal with.

Kepler/Workflow Comments • Decided to stay with brute force model instead of loop based model. Loop based model would probably introduce controller actors that would obscure the provenance capture. • Issue of what to capture provenance on for more general workflows • Coding actors for each thing you want to do doesn’t scale and is a barrier to adoption by scientists. • Can’t control actor firing order which resulted in things like AlignWarp4 producing warp1.warp • We used string constant actors to supply input files but it makes more sense for Kepler to support the concept of a data source. • We could not tell if a port value was a file except by using File.exists() • Would like to see events be external for complete separation from workflow engine.

Out of (Current) Scope • Dynamically changing and continuing workflows (ie evolving workflows) • Pointing back to provenance on Actors. A real system would do this and the Actors themselves would have global ideas that could be referenced. • Capturing provenance on workflow descriptors and pointing back to them (same as above for actors) • Use of lsids – we have service running but never got to the point of inserting it. Instead we used a url name generation service. • Signing results

How data was generated User set parameter values Workflow structure/execution capture Outside tools Auto-generated metadata/content The structure of the query 2 phase query (or recursive) Specifying wht to include Specifying what to exclude What it will be used for Exploratory analysis Directed query to answer a specific question Debugging Verification Comparison Brainstorming Categorization

Automated Provenance Capture Architecture and Implementation

Automated Provenance Capture Architecture and Implementation

Presentation Transcript

EU PROVENANCE project: an open provenance architecture for distributed applications

Architecture Implementation Techniques

Fully Automated, “Taskless,” Lecture Capture and Web Publishing System

Architecture, Implementation, and Testing

Provenance Capture in Data Access And Data Manipulation Software

Lecture 14: Dissemination and Provenance Analytic Provenance

Capture Disparities Highlighted by Provenance Datasets

Provenance Capture in Data Access And Data Manipulation Software

Provenance and Sources

Haggle Architecture and Reference Implementation

enhancing lectures with automated lecture capture ( eltac )

Automated Capture and Access of Virtual Meetings in TeamSpace

3D Slicer Architecture and Implementation

Enterprise Architecture Implementation

ES3 architecture ES3’s provenance capture system uses a client-server architecture.

PROVENANCE

Provenance

Provenance

Architecture Implementation

Provenance

implementation and architecture support