1 / 37

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

Was Derived From. UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems. Artem Chebotko Joint work with E. De Hoyos , C. Gomez, A. Kashlev , X. Lian , and C. Reilly Department of Computer Science University of Texas - Pan American

chesna
Download Presentation

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Was Derived From UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012

  2. Provenance in eScience • Metadata that captures history of an experiment • Problem diagnosis • Result interpretation • Experiment reproducibility • Scientific Workflow Community Provenance Challenges • 2006: understanding and sharing information about provenance representations and capabilities • 2006: interoperability of different provenance • 2009: evaluating various aspects of OPM • 2010: showcase OPM in the context of novel applications • Open Provenance Model • W3C Provenance Working Group UTPB – University of Texas Provenance Benchmark

  3. SWFMS and Provenance • Support provenance collection • Use proprietary of third-party systems to manage provenance • Differ in provenance models, provenance vocabularies, inference support, and query languages. • Taverna • Kepler • View • VisTrails, • Pegasus • Swift • Galaxy • Triana • OPMProv • Karma • RDFProv • etc. UTPB – University of Texas Provenance Benchmark

  4. Provenance Management Requirements • Non-functional • Data storage and querying efficiency and scalability • Inference soundness and completeness • Functional • Support of a particular, provenance model, provenance vocabulary, query type, inference feature, visualization and analysis • No standard way to evaluate provenance systems with respect to these requirements UTPB – University of Texas Provenance Benchmark

  5. Provenance System Benchmarking Challenges • Well-documented and easy-to-understand datasets • Provenance data in a range of sizes • Provenance data with predefined inferred results that are known to be correct and complete • Test queries • Performance metrics • Result interpretation • Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details) UTPB – University of Texas Provenance Benchmark

  6. Our Contributions • University of Texas Provenance Benchmark (UTPB) • http://faculty.utpa.edu/chebotkoa/utpb/ • Focus on scalability and inference • Flexible data generator • 27 provenance templates • 3 virtual workflows • 3 workflow execution scenarios • 3 provenance vocabularies • 27 test queries in 11 categories • 5 performance metrics UTPB – University of Texas Provenance Benchmark

  7. Talk Outline • University of Texas Provenance Benchmark • UTPB Architecture • Provenance Templates • Provenance Generation • UTPB Queries • Performance Metrics • Interpretation of Benchmark Results • Summary and Future work UTPB – University of Texas Provenance Benchmark

  8. UTPB Architecture UTPB – University of Texas Provenance Benchmark

  9. UTPB Architecture UTPB – University of Texas Provenance Benchmark

  10. Provenance Templates UTPB – University of Texas Provenance Benchmark

  11. Provenance Templates • A provenance template is a document that serializes provenance of one workflow execution according to a particular provenance model and a provenance vocabulary. • Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field. • UTPB currently supports: • 1 provenance model (OPM) • 3 virtual workflows • 3 provenance vocabularies (OPMV, OPMO, OPMX) • 3 workflow execution scenarios • 1 x 3 x 3 x 3 = 27 provenance templates UTPB – University of Texas Provenance Benchmark

  12. Virtual Workflow 1 • Database Experiment • Processes: 7 • Artifacts:14 • Accounts: 2 • Agents: 1 UTPB – University of Texas Provenance Benchmark

  13. Virtual Workflow 2 • Jeans Manufacturing • Processes: 13 • Artifacts:18 • Accounts: 3 • Agents: 2 • Several processes use and generate the same artifacts and are “executed” in parallel UTPB – University of Texas Provenance Benchmark

  14. Virtual Workflow 3 • French Press Coffee • Processes: 15 • Artifacts:15 • Accounts: 4 • Agents: 0 • Several branches with multiple processes are “executed” in parallel • Several processes trigger each other without the record of using or generating artifacts UTPB – University of Texas Provenance Benchmark

  15. Provenance Vocabularies • Almost every existing scientific workflow management system defines its own proprietary model for provenance • Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas. • Open Provenance Model (OPM) – a layer of interoperability • OPM Vocabulary • OPM Ontology • OPM XML Schema UTPB – University of Texas Provenance Benchmark

  16. Workflow Execution Scenarios • successful execution • incomplete execution with an error • successful execution with materialized provenance inferences UTPB – University of Texas Provenance Benchmark

  17. Provenance Generation UTPB – University of Texas Provenance Benchmark

  18. Provenance Generation UTPB – University of Texas Provenance Benchmark

  19. Provenance Generation UTPB – University of Texas Provenance Benchmark

  20. Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmv: <http://purl.org/net/opmv/ns#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix utpb: <http://cs.panam.edu/utpb#> . utpb:account_black_C0_T0 rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> . utpb:cuttingMachine_C0_T0 rdf:typeopmv:Artifact . utpb:denim_C0_T0 rdfs:label"blue" . utpb:andrey_C0_T0 rdf:typeopmv:Agent . utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0 . utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0 . # Default graph <http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . OPMV UTPB – University of Texas Provenance Benchmark

  21. Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmo: <http://openprovenance.org/model/opmo#> . @prefix opmv: <http://purl.org/net/opmv/ns#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix utpb: <http://cs.panam.edu/utpb#> . utpb:account_black_C0_T0 rdf:typeopmo:Account . utpb:cuttingMachine_C0_T0 rdf:typeopmv:Artifact . utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value"blue" . utpb:andrey_C0_T0 rdf:typeopmv:Agent . utpb:used1_C0_T0 rdf:typeopmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0 . utpb:wgb1_C0_T0 rdf:typeopmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0 . # Default graph <http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . OPMO UTPB – University of Texas Provenance Benchmark

  22. Provenance Generation <utpbxmlns="http://openprovenance.org/model/opmx#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <dictionary> <opmGraphid="opmGraph_C0_T0"> </dictionary> <opmGraphid="opmGraph_C0_T0"> <accounts> <account id="account_black"/> </accounts> <artifacts> <artifact id="cuttingMachine"> <account ref="account_black"/> <annotation> <property key="value"> <value>laser</value></property> <property key="label"> <value>Cutting machine</value></property> </annotation> </artifact> </artifacts> <agents> <agent id=“andrey”><account ref="account_black"/></agent> </agents> <dependencies> <used id=“used1”> <effect ref="cutDenim"/> <role id="roleMachine1” value="machine"/> <cause ref="cuttingMachine"/> <account ref="account_black"/> </used> OPMX UTPB – University of Texas Provenance Benchmark

  23. UTPB Queries UTPB – University of Texas Provenance Benchmark

  24. UTPB Queries • 27 Queries • 11 Categories • Graphs • Dependencies • Artifacts • Processes • Accounts • Agents • Roles • Values • Cross-Graph Queries • Inferences • Application-Specific UTPB – University of Texas Provenance Benchmark

  25. UTPB Queries UTPB – University of Texas Provenance Benchmark

  26. UTPB Queries UTPB – University of Texas Provenance Benchmark

  27. UTPB Queries UTPB – University of Texas Provenance Benchmark

  28. UTPB Queries effectArtifactcauseArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 OPMV UTPB – University of Texas Provenance Benchmark

  29. UTPB Queries effectArtifactcauseArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0 OPMO UTPB – University of Texas Provenance Benchmark

  30. UTPB Queries <result xmlns="http://openprovenance.org/model/opmx#" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wasDerivedFrom> <effect ref="denimParts_C0_T0"/> <cause ref="denim_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="denimParts_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="sewingThread_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="washedJeans_C0_T0"/> <cause ref="rawJeans_C0_T0"/> </wasDerivedFrom> … <wasDerivedFrom> <effect ref="inspectedJeans_C0_T0"/> <cause ref="washedJeans_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="qualityJeans_C0_T0"/> <cause ref="buttonedJeans_C0_T0"/> </wasDerivedFrom> </result> OPMX UTPB – University of Texas Provenance Benchmark

  31. Performance Metrics UTPB – University of Texas Provenance Benchmark

  32. Performance Metrics • Data loading time • Repository size • Query response time • Query soundness • Query completeness UTPB – University of Texas Provenance Benchmark

  33. Interpretation of Benchmark Results UTPB – University of Texas Provenance Benchmark

  34. Interpretation of Benchmark Results • Comparison across datasets of varying sizes • Comparison using a fixed dataset • Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO) • Comparison across data managed using different technologies (e.g., RDF vs. XML) • Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future UTPB – University of Texas Provenance Benchmark

  35. Summary and Future Work UTPB – University of Texas Provenance Benchmark

  36. Summary and Future Work • UTPB: A first formal benchmark for scientific workflow provenance management systems • Extensible with new provenance templates • Flexible data generation • Large selection of test queries • Well defined performance metrics • Future work • Benchmarking existing system using UTPB • Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness) UTPB – University of Texas Provenance Benchmark

  37. THANK YOU! Questions? • UTPB website: • http://faculty.utpa.edu/chebotkoa/utpb/ • My contact information: • Artem Chebotko, Department of Computer Science, University of Texas – Pan American • chebotkoa@utpa.edu • http://www.cs.panam.edu/~artem UTPB – University of Texas Provenance Benchmark

More Related