1 / 14

Automated Provenance Capture Architecture and Implementation

Automated Provenance Capture Architecture and Implementation. Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL. Methodology. Simulated using Kepler workflow system. We did not attempt to leverage looping.

platt
Download Presentation

Automated Provenance Capture Architecture and Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL

  2. Methodology • Simulated using Kepler workflow system. We did not attempt to leverage looping. • Programmed stub actors for each step with proper inputs, outputs, and user controlled parameters. • Implemented an execution event listener that optionally records workflow. No changes to core Kepler were made. • Applied/extended our existing content management/provenance system to see how far we could go with it. • Implemented actors /workflows for queries/visual analysis using xslt/graphviz

  3. Provenance Capture Architecture Client Tools Workflow Tools events Naming Service Prov Capture Service Analysis Tools Workflow Engine Query Tools Prov Service Query Service Workflow UI Browsing Tools Annotation Tools Provenance/Content System Metadata extraction Triple Store Content Store Indexing Harvesting Tools Translation

  4. SDG Provenance Capture Implementation All within Kepler Client Tools Workflow Tools events URL, LSID Kepler workflows Prov Capture Service Kepler Engine Node x Node Comparison URIQA (rdf), webDAV Bategalj/ Mrvar Algorithm URIQA(rdf),/ webDAV Kepler UI GML, webDav SEDASL Nettool Prov processor Query processor Metadata extraction Triple Store Content Store DavExplorer/ Ecce Defuddle Translation Lucene Triple Harvesters SAM

  5. Physical Model Property can be a link or a value Property 0..n Named Thing 1 Content Any “thing”, for which we want to capture some information, is given a unique id with which properties and relationships can be associated. Additionally, content can be associated with these “things”.

  6. Logical Overlay Workflow Instance isPartOf startedExecution finishedExecution creator wasRunBy owningInstitution created title Actor Instance 1 hasStatus format 0..n format startedExecution finishedExecution createdWith hasSource title uid [arbitrary triple]* uid hasParameter 1 1 [arbitrary triple]* isPartOf 0..n 1 isInput Parameter hasValue OR hasHashOf Value format title hasOutput uid 0..n Port Value [arbitrary triple]* format hasValue OR hasHashOf Value created title uid [arbitrary triple]*

  7. Semantically Extended DASL Queries Select- all properties or a specific list - format (gxl, rdf, webDAV) Scope - a url or query (i.e. 2 phase) - names of properties to follow (and direction) - stop conditions (property/values comparisons, depth) Where - property name/valuecomparisons, content search

  8. isInput-reverse hasOutput isPartOf-reverse • title • instantiationOf • source • value • format hasOutput-reverse isInput isPartOf Workflow Comparisons • Node-by-node comparisons • Nodes match if all node attributes and incoming and outgoing edges match • Nodes are similar if attributes and edges match to some specified XX% • After node comparisons, edges are compared • Edges match if connecting nodes were found to be exactly matching or similar and edge attributes match • Edges are similar if attributes match to some specified XX% • Outputs include: • Matching or similar nodes, • Matching or similar edges, • Nodes only in first or second graph, • Edges only in first or second graph Nodes only in First Graph: node52 (atlas-z.gif, ) node14 (imageformat, ) node53 (convertyimage,) node57 (atlas-y.gif, ) node36 (convertzimage) node34 (imageformat, ) node15 (imageformat, ) node78 (atlas-x.gif, ) node26 (convertximage) Count: 9

  9. Workflow Graph Distances (0, 4, 6, 1,…) (0, 4, 6, 1,…) • Implements social network algorithm based on triad census (Batagelj and Mrvar, 2001; Chin, Whitney, Powers, and Johnson, 2004) • Examines every three possible nodes of a workflow graph • Every three possible combination fall into 1 of 64 possible triad states • Census is counts of triads that exist in the graph, which may be used to summarize or profile overall graph structure • Distance is computed by taking Euclidean distance of two triad censuses which is normalized to a 0.0..1.0 value • Most useful for assessing similarity across large, complex workflows • Distance computed for two workflow graphs: 0.095888

  10. Combined rdf assertions with scientific content management flexible capabilities for metadata extraction (e.g. Defuddle to extract data from warp file). Existing rdf harvestors could be plugged in through same mechanism Extensible translation mechanism (browse tools can provide views of raw data such as a table of warp parameters) Conceptually simple model that can apply to much more than workflow execution. Readily adaptable to alternative models, constructs, relationships. Indexing and Query of content or metadata All relationships are reverse indexed automatically. You can search up or down and event mix directions on specific properties. Flexible event based model so as to minimize connections into workflow engine Actors can contribute their own metadata easily through events User control over which actors to capture provenance on Automatic content type determination Multiple output formats Capability to capture hashes instead of values Leveraged DASL extension mechanisms Based on existing standards (http) – existing tools can be leveraged Pluggable authentication model base on JAAS Everything is open source What’s Cool

  11. Limitations • Prov capture is slow. We do one assertion at a time currently but they could all be packaged up into one request • RDF predicates can’t contain special characters but things like parameters often have these characters. • SAM can be made to work but current implementation based on WebDAV ties resources to metadata. We had to create dummy resources. • SAM not rdf based. • Big files (reference images) are duplicated as part of provenance tracking because they are data inputs to multiple actors. • Did not get to LSID service but it would be nice if this wasn’t a separate protocol to deal with.

  12. Kepler/Workflow Comments • Decided to stay with brute force model instead of loop based model. Loop based model would probably introduce controller actors that would obscure the provenance capture. • Issue of what to capture provenance on for more general workflows • Coding actors for each thing you want to do doesn’t scale and is a barrier to adoption by scientists. • Can’t control actor firing order which resulted in things like AlignWarp4 producing warp1.warp • We used string constant actors to supply input files but it makes more sense for Kepler to support the concept of a data source. • We could not tell if a port value was a file except by using File.exists() • Would like to see events be external for complete separation from workflow engine.

  13. Out of (Current) Scope • Dynamically changing and continuing workflows (ie evolving workflows) • Pointing back to provenance on Actors. A real system would do this and the Actors themselves would have global ideas that could be referenced. • Capturing provenance on workflow descriptors and pointing back to them (same as above for actors) • Use of lsids – we have service running but never got to the point of inserting it. Instead we used a url name generation service. • Signing results

  14. How data was generated User set parameter values Workflow structure/execution capture Outside tools Auto-generated metadata/content The structure of the query 2 phase query (or recursive) Specifying wht to include Specifying what to exclude What it will be used for Exploratory analysis Directed query to answer a specific question Debugging Verification Comparison Brainstorming Categorization

More Related