1 / 30

Applying the Virtual Data Provenance Model

Applying the Virtual Data Provenance Model. IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006. Virtual Data Origins: The Grid Physics Network. Enhance scientific productivity through… Discovery, application and management of data and processes at all scales

dora-doyle
Download Presentation

Applying the Virtual Data Provenance Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

  2. Virtual Data Origins:The Grid Physics Network Enhance scientific productivity through… • Discovery, application and management of data and processes at all scales • Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

  3. The purpose of Virtual Data • Better understanding of data (and tools) • Assess what happened at run-time • Easier to express and execute work • Discover useful workflow patterns • Adapt workflow patterns to new needs

  4. Virtual Data Example:Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago

  5. Query What I Am Doing Executed Edit Executing Executable What I Did … What I Want to Do Waiting Execution environment Schedule How is Workflow and Provenance connected? • Workflow – specifies what to do • Provenance – tracks what was done • Virtual Data integrates these capabilities

  6. Temporal aspects of provenance • Prospective provenance • The workflow recipes for how to produce data • Metadata annotating code and data • Retrospective provenance • Run-time records of data production environment: where, how long, how much

  7. Connect applications via output-to-input dependencies Expressing Workflow in VDL file1 Define a “function” wrapper for an application TR grep (in a1, out a2) { argument stdin = ${a1};  argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep (a1=@{in:file1}, a2=@{out:file2}); DV sort (a1=@{in:file2}, a2=@{out:file3}); grep Define “formal arguments” for the application file2 sort file3 Define a “call” to invoke application Provide “actual” argument values for the invocation

  8. Terminology • virtual data • defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation • VDL – Virtual Data Language • A language (text and XML) that defines the functions and function calls of a workflow • VDC – Virtual Data Catalog • The database and schema that store VDL definitions • VDS – Virtual Data System • The tools to define, store, manipulate and execute virtual data workflows and query data provenance

  9. Executing VDL Workflows Workflow spec Create Execution Plan Grid Workflow Execution VDL Program Pegasus Planner DAGman DAG Virtual Data catalog DAGman & Condor-G Show world and results in large DAG on right, as animated overlay Job Planner Job Cleanup Virtual Data Workflow Generator Abstract workflow

  10. Dimensions of Provenance Data Virtual DataCatalog

  11. Virtual Data Catalog Schema

  12. file1 launcher grep Provenance data file2 launcher sort Provenance data file3 Run Time Environmentand Provenance Collection Specify Workflow Create and run DAG Grid Workflow Execution(on worker nodes) Pegasus Planner Abstract workflow Virtual Data Workflow Generator DAGman script Virtual Data catalog DAGman & Condor-G Provenance collector VDL

  13. Provenance Query Types • Virtual data relationships • Annotations • Lineage • Multi-dimension • Compositional

  14. Context for Query Examples:Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

  15. Virtual Data Relationships • Simple query by signature: • Show procedures in namespace /pub/bin/std that have inputs of type SubjectImage and outputs of type ThumbNailImage. • Actual arguments and runtime provenance: • Show alignlinear calls (including all arguments), in XML format, with argument model=rigid, and which generated more than 10,000 page faults, on ia64 processors. • Show calls to procedure alignlinear, and their runtimes, with argument model=rigid that ran in less than 30 minutes on non-ia64 processors. • Aggregate query: • Show me the average runtime of all alignlinear calls with argument model=rigid that ran in less than 30 minutes.

  16. Provenance forLarge-scale ATLAS Simulation How much compute time was delivered? | years| mon | year | +------+------+------+ | .45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556 ... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel? (Data from work of Robert Gardner and Marco Mambelli, University of Chicago)

  17. Annotation queries • Return annotations for any object type • procedures, calls, arguments, and files • select based on subject, predicate, (object, type) • select a set of virtual data objects: • find all objects (of any type) annotated with predicate p of type t and value v • find objects of a specific type annotated with predicate p of type t and value v; • find objects (one type or any type) annotated by same set of attribute predicates.

  18. fMRI Virtual Data Queries Which transformations can process a “subject image”? • Q: xsearchvdc -q tr_meta dataType subject_image input • A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: • Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young • A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: • Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img • A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

  19. Annotation Queries • Show all the annotations of [datasets that have metadata annotation studyModality with values speech, visual or audio]. • Show all the developerName annotations of [procedures that accept or produce an argument of type Study with annotation studyModality=audio]

  20. Lineage Queries • Basic lineage graph queries refer to information that has been propagated along derivation relationships: • find datasets derived from dataset d • find ancestor datasets to dataset d that have type t • find datasets that were derived within 2 levels of procedure p • find datasets that are the result of workpattern wp; • find the procedure calls in workflow w whose inputs have been processed by any subgraph matching workpattern wp.

  21. Planned approach: Workpatterns • Match graph patterns of transformations, calls, and invocations; fixed or varying numbers of nodes • Match on expressions with argument name, argument values, argument types, and/or annotations • Workpattern query yields set of workflows with subgraphs that match the workpattern • The target search space of a workpattern query can be entire database, or a specific set of workflows selected by a prior search.

  22. Workflow Patterns • Given the workpattern: • align (model=affine) reslice (axis=x, intensify=3)  softmean • show me all output datasets of softmean calls that were linear-aligned with model=affine. • I.e., “where softmean was preceded in the workflow, directly or indirectly, by an alignlinear call with argument model=affine” • Show me all output datasets of softmean that were resliced with intensify=3. (Looking for a softmean that is directly preceded by the requested pattern)

  23. Workflow Pattern Searches align_warp/*/softmean softmean/slicer*

  24. Multidimension Provenance Queries • “find transformations with signature S that have been called with arguments V and which match: • an annotation query • the metadata values for a specified set of predicates from a transformation list returned by another query • the minimum, maximum, and average run times of a set of procedure calls matching workpattern wp and annotation query q.”

  25. Multi-dimension queries • Find procedures that take in ImageAtlas and Date,have been called with atlas.std.2005.img,and have annotation QALevel > 5.6 • Display metadata tags study-type on result datasets that were linearly aligned with parameter model=affineand with an input dataset annotated with center set to UofChicago • Show me the output dataset names (and all their metadata tags) that were linearly aligned with model=affine and with input LFN metadata center=UChicago • Show annotations school of output datasets of softmeanwith values in set {UIUC, UChicago, UIC}. • Show annotations school with values in set {UIUC, UChicago, UIC} of outputs of softmean that were aligned with model=affine (graph relationship)

  26. Modification and composition queries • Change arguments in a set of calls • Change procedures in a set of workflows • Edit subgraphs of a workflow, creating new workflows • Edit metadata throughout a workflow

  27. Workflow Transformation More accurate alignment tool? Better renderingalgorithm?

  28. Workflow transformation warpnslice warpnslice warpnslice warpnslice render

  29. Conclusion • Separation of dimensions facilitates schema design and comprehension • Workflow and provenance are inextricably linked • Integration of dimensions in query is powerful • Graph matching and editing paradigms are essential tools in this unified treatment • Efficient implementation of these searches will take time and research • Elegance and usability are key factors in the query language

  30. Acknowledgements …many thanks to the entire OSG Collaboration and our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and fMRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. The Virtual Data System group is: • ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi • U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ, Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao • www.griphyn.org/vds GriPhyN and iVDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.

More Related