Harvard’s PASS Takes on The Provenance Challenge

Harvard’s PASS Takes onThe Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

Reminder: What is PASS? • Storage systems (e.g., file systems) in which provenance is a first class entity. • Provenance: • is generated and maintained as transparently as possible. • can be indexed and queried. • will be created from objects imported from non-PASS sources. • is maintained in the presence of deletes, copies, renames, etc.

env=“USER…” argv=“sort a” task_struct name=“sort modules=“pasta…” kernel=“Linux…” Inode cache Collecting Provenance Kernel % sort a > b fork open b (W) exec “sort a” open a (R) read a write b close a close b input=sort input=a sort b a To file system

Things to Keep in Mind • Our focus is provenance collection, not query. • We collect provenance of everything. • Provenance collection is done in the operating system. • Queries are simply queries against the database maintained by the kernel. • Our kernel database is Berkeley DB.

Results Summary • Workflow: we ran the shell script • Dropped in all the programs and simply ran them on Linux. • Chose not to run the slicer, because the license worried us. • Query: command-line query tool: nq • Successfully ran all queries • Generated a lot more output than you really want. • Strategy is to keep everything and provide pruning to let users see what they want.

Query Tool: NQ • General form: • nq [SELECTION] SEARCH [FILTER] OUTPUT-TYPE • SELECTION: select FIELD … from • FIELD: FIELD-NAME, concat(FIELD-NAME), $ANNOTATION, nameof(FIELD-NAME), typeof(FIELD-NAME) • SEARCH: ancestors FILE*, descendents FILE*, everything • FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR • OUTPUT-TYPE: report, report html, table • EXPR: existing, nonexisting, EXPR op EXPR

Q1: Provenance of Graphic X • nq 'ancestors atlas-x.gif report’ 922.0 [passfile; challenge/atlas-x.gif] version 1 type: passfile name: challenge/atlas-x.gif input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 annotation: dim=x annotation: run=base annotation: studyModality=mindreading • And 4806 other objects… • Results: QUERIES\q1.html

Q2: Q1 excluding prior to softmean • Query: nq ‘ancestors atlas-x.gif anchor (type == “proc” && name == “AIR5.2.5/bin/softmean”) report’ • Result: essentially a subset of Q1 • “only” 148 objects identified

Q3: Q2 w/stages • We did not create annotations to map to stages, so this query degenerates to the same one as Query 2.

Q4: align_warp w/specific parameter values • Query: nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report’ • Results: • We did our run on Monday • Returns 8 instances: • Four from the main workload • Four from the variant workload used in Query 7

Q5: images with max=4095 • Two alternate approaches: • Three phase solution • Create list of header files that are ancestors of align_warp • Pass list of files to scanheader; “grep max=4095” • Find all the descendents of the headers • Annotation approach • Run scanheader on all headers • Make results of scanheader annotations • Query on the annotations • We used the first approach

Q5 Continued • Create list of files to query ALIGN_WARPS=`$NQ $NQOPTS 'select ident from everything where type == "proc" && basename == "align_warp" table‘ `$NQ $NQOPTS’ select name from ancestors {'"$ALIGN_WARPS"' } depth where basename ~ "*.hdr” table’ • Call scanheader on everything returned above, selecting those files where max=4095

Q5 Continued • Query on the list returned above nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr } where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg" report' • Results

Q6: images produced by softmean with a particular align_warp parameter • Three stage query: • Find align_warp processes LIGN_WARPS=`nq ’select ident from everything where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12*” table'` • Find appropriate softmean processes • SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' } where type == "proc" && basename == "softmean" table' • Find images produced by softmean processes nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 where type == "passfile" && basename ~ "*.img" report’ • Results: 940.0 [passfile; challenge/q7/atlas.img] version 1 name: challenge/q7/atlas.img 917.0 [passfile; challenge/atlas.img] version 1 name: challenge/atlas.img

Q7: Difference between original and new workflow • We use standard diff of textual output nq 'ancestors atlas-x.gif report' > q7-a.tmp nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp diff -u q7-a.tmp q7-b.tmp • Result: • -922.0 [passfile; challenge/atlas-x.gif] version 1 • +945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 • type: passfile • - name: challenge/atlas-x.gif • - input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0 • + name: challenge/q7/atlas-x.jpg • + input: 945.2 [proc; pid 2961; /usr/bin/pnmtojpeg] version 0 • annotation: dim=x • - annotation: run=base • - annotation: studyModality=mindreading • + annotation: run=q7 • + annotation: studyModality=visual

Q8: FindUChicago align_warp outputs • Three stage query: • Find everything annotated with UChicago • INPUTS=`nq 'select ident from everything where $center == "UChicago" table’ • Find those Uchicago objects that are the result of align_warp • `WARPS=`nq 'select ident from descendents { '"$INPUTS"' } depth 1 where type == "proc" && basename == "align_warp" table’ • Now, find all the outputs of those processes • `nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'

Q8 Continued • Results 930.0 [passfile; challenge/q7/warp3.warp]version 1 type: passfile name: challenge/q7/warp3.warp 929.0 [passfile; challenge/q7/warp2.warp]version 1 type: passfile name: challenge/q7/warp2.warp 907.0 [passfile; challenge/warp3.warp]version 1 type: passfile name: challenge/warp3.warp 906.0 [passfile; challenge/warp2.warp]version 1 type: passfile name: challenge/warp2.warp

Q9: Find user annotations for objects where some annotations have a given value • Setup • We added annotations to all six output images • We annotated one set of outputs with modality visual and the other modality mind-reading. • Query nq 'select annotations from everything where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && ($studyModality == "speech" || $studyModality == "visual" || $studyModality == "audio") report'

Q9 Continued • Results 947.0 [passfile; challenge/q7/atlas-z.jpg]version 1 annotation: dim=z annotation: run=q7 annotation: studyModality=visual 946.0 [passfile; challenge/q7/atlas-y.jpg] version 1 annotation: dim=y annotation: run=q7 annotation: studyModality=visual 945.0 [passfile; challenge/q7/atlas-x.jpg] version 1 annotation: dim=x annotation: run=q7 annotation: studyModality=visual

Conclusions/Observations • We have the data • We are not UI people • Output is remarkably complete • Sometimes makes it difficult to extract the information you want. • Output is BIG if you ask for everything, but … • … you can ask for everything and get it

Harvard’s PASS Takes on The Provenance Challenge