1 / 20

Harvard’s PASS Takes on The Provenance Challenge

Harvard’s PASS Takes on The Provenance Challenge. September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences. Reminder: What is PASS?. Storage systems (e.g., file systems) in which provenance is a first class entity. Provenance:

channer
Download Presentation

Harvard’s PASS Takes on The Provenance Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvard’s PASS Takes onThe Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

  2. Reminder: What is PASS? • Storage systems (e.g., file systems) in which provenance is a first class entity. • Provenance: • is generated and maintained as transparently as possible. • can be indexed and queried. • will be created from objects imported from non-PASS sources. • is maintained in the presence of deletes, copies, renames, etc.

  3. env=“USER…” argv=“sort a” task_struct name=“sort modules=“pasta…” kernel=“Linux…” Inode cache Collecting Provenance Kernel % sort a > b fork open b (W) exec “sort a” open a (R) read a write b close a close b input=sort input=a sort b a To file system

  4. Things to Keep in Mind • Our focus is provenance collection, not query. • We collect provenance of everything. • Provenance collection is done in the operating system. • Queries are simply queries against the database maintained by the kernel. • Our kernel database is Berkeley DB.

  5. Results Summary • Workflow: we ran the shell script • Dropped in all the programs and simply ran them on Linux. • Chose not to run the slicer, because the license worried us. • Query: command-line query tool: nq • Successfully ran all queries • Generated a lot more output than you really want. • Strategy is to keep everything and provide pruning to let users see what they want.

  6. Query Tool: NQ • General form: • nq [SELECTION] SEARCH [FILTER] OUTPUT-TYPE • SELECTION: select FIELD … from • FIELD: FIELD-NAME, concat(FIELD-NAME), $ANNOTATION, nameof(FIELD-NAME), typeof(FIELD-NAME) • SEARCH: ancestors FILE*, descendents FILE*, everything • FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR • OUTPUT-TYPE: report, report html, table • EXPR: existing, nonexisting, EXPR op EXPR

  7. Q1: Provenance of Graphic X • nq 'ancestors atlas-x.gif report’ 922.0 [passfile; challenge/atlas-x.gif] version 1 type: passfile name: challenge/atlas-x.gif input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 annotation: dim=x annotation: run=base annotation: studyModality=mindreading • And 4806 other objects… • Results: QUERIES\q1.html

  8. Q2: Q1 excluding prior to softmean • Query: nq ‘ancestors atlas-x.gif anchor (type == “proc” && name == “AIR5.2.5/bin/softmean”) report’ • Result: essentially a subset of Q1 • “only” 148 objects identified

  9. Q3: Q2 w/stages • We did not create annotations to map to stages, so this query degenerates to the same one as Query 2.

  10. Q4: align_warp w/specific parameter values • Query: nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report’ • Results: • We did our run on Monday • Returns 8 instances: • Four from the main workload • Four from the variant workload used in Query 7

  11. Q5: images with max=4095 • Two alternate approaches: • Three phase solution • Create list of header files that are ancestors of align_warp • Pass list of files to scanheader; “grep max=4095” • Find all the descendents of the headers • Annotation approach • Run scanheader on all headers • Make results of scanheader annotations • Query on the annotations • We used the first approach

  12. Q5 Continued • Create list of files to query ALIGN_WARPS=`$NQ $NQOPTS 'select ident from everything where type == "proc" && basename == "align_warp" table‘ `$NQ $NQOPTS’ select name from ancestors {'"$ALIGN_WARPS"' } depth where basename ~ "*.hdr” table’ • Call scanheader on everything returned above, selecting those files where max=4095

  13. Q5 Continued • Query on the list returned above nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr } where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg" report' • Results

  14. Q6: images produced by softmean with a particular align_warp parameter • Three stage query: • Find align_warp processes LIGN_WARPS=`nq ’select ident from everything where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12*” table'` • Find appropriate softmean processes • SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' } where type == "proc" && basename == "softmean" table' • Find images produced by softmean processes nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 where type == "passfile" && basename ~ "*.img" report’ • Results: 940.0 [passfile; challenge/q7/atlas.img] version 1 name: challenge/q7/atlas.img 917.0 [passfile; challenge/atlas.img] version 1 name: challenge/atlas.img

  15. Q7: Difference between original and new workflow • We use standard diff of textual output nq 'ancestors atlas-x.gif report' > q7-a.tmp nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp diff -u q7-a.tmp q7-b.tmp • Result: • -922.0 [passfile; challenge/atlas-x.gif] version 1 • +945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 • type: passfile • - name: challenge/atlas-x.gif • - input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 • + name: challenge/q7/atlas-x.jpg • + input: 945.2 [proc; pid 2961; /usr/bin/pnmtojpeg] version 0 • annotation: dim=x • - annotation: run=base • - annotation: studyModality=mindreading • + annotation: run=q7 • + annotation: studyModality=visual

  16. Q8: FindUChicago align_warp outputs • Three stage query: • Find everything annotated with UChicago • INPUTS=`nq 'select ident from everything where $center == "UChicago" table’ • Find those Uchicago objects that are the result of align_warp • `WARPS=`nq 'select ident from descendents { '"$INPUTS"' } depth 1 where type == "proc" && basename == "align_warp" table’ • Now, find all the outputs of those processes • `nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'

  17. Q8 Continued • Results 930.0 [passfile; challenge/q7/warp3.warp]version 1 type: passfile name: challenge/q7/warp3.warp 929.0 [passfile; challenge/q7/warp2.warp]version 1 type: passfile name: challenge/q7/warp2.warp 907.0 [passfile; challenge/warp3.warp]version 1 type: passfile name: challenge/warp3.warp 906.0 [passfile; challenge/warp2.warp]version 1 type: passfile name: challenge/warp2.warp

  18. Q9: Find user annotations for objects where some annotations have a given value • Setup • We added annotations to all six output images • We annotated one set of outputs with modality visual and the other modality mind-reading. • Query nq 'select annotations from everything where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && ($studyModality == "speech" || $studyModality == "visual" || $studyModality == "audio") report'

  19. Q9 Continued • Results 947.0 [passfile; challenge/q7/atlas-z.jpg]version 1 annotation: dim=z annotation: run=q7 annotation: studyModality=visual 946.0 [passfile; challenge/q7/atlas-y.jpg] version 1 annotation: dim=y annotation: run=q7 annotation: studyModality=visual 945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 annotation: dim=x annotation: run=q7 annotation: studyModality=visual

  20. Conclusions/Observations • We have the data • We are not UI people • Output is remarkably complete • Sometimes makes it difficult to extract the information you want. • Output is BIG if you ask for everything, but … • … you can ask for everything and get it

More Related