Enhancing Data Provenance in Research with lbsh: Improving Reproducibility

lbsh: Breadcrumbs as You Work Eric Osterweil

Problem • Measurement studies, simulations, and many other investigations require a lot of data work • Data processing (or experimentation) can be ad-hoc • Given some raw data (measurements, observations, etc.) we often “try a number of things” • Simulations are often tweaked and re-run numerous times • In this mode, experiments can recursively lead subsequent experiments • How can a researcher always remember the exact provenance of their results?

What is Data Provenance? • The concept of data provenance is a lot like “chain of custody” • More specifically, we borrow a definition from [1]: • “… We defined data provenance as information that helps determine the derivation history of a data product, starting from its original sources.”

Why Does Provenance Matter? • Do we need to be able to remember exactly how we got results? • Setting: • Student: does a lot of processing, gets “compelling” results • Advisor: wants to re-run with new data • Student: panics (silently of course) • Reviewers: hope that results are reproducible

Sharing Work With Others? • How many people have had to re-implement someone else’s algos for a paper? • How about getting a tarball of crufty scripts from an author and trying to get them to work? • What if you could get a tarball that was totally self-descriptive • What if the tarball could totally describe the work that lead to that user’s results? • What if it could allow you to re-run the whole thing?

Example • sort data2.out > sorted-data2.out • awk ‘{print $1 “\t” $5}’ sorted-data2.out > dook • sort data3.out > sorted-data3.out • join -1 1 -2 1 dook sorted-data3.out > blah • vi script.sh • script.sh dook • awk ‘{tot+=$1*$1}END{print tot}’ blah > day1.out • vi blah.pl • blah.pl data1.out data2.out > day2.out • sort data1.out > sorted-data1.out • join -1 1 -2 1 dook sorted-data1.out > blah • vi blindluck.awk • blindluck.awk blah > day3.out

Results? • What if it turns out that “day3.out” has the results I wanted? • Can anyone recall what the commands were for that? • Were any of the files overwritten?

Outline • Inspiration • Goal • lbsh (Pound-Shell) • Usage • Contribution • Future

Inspiration • Computer Scientists cannot be the first group to have this problem • In fact, we’re not • Science is predicated on reproducibility, so how do (for example) biologists deal with this? • They have lab-books, and they take notes

Can We Do the Same? • A biologist may make a few notes and then spend several days conducting experiments • Conversely, we process data as fast as we can type, and block on I/O occasionally • Note taking is a small task in proportion to a biologist’s experiment • Note taking is a large task in proportion to our fast-fingers • Even then, a lab-book can look like a dictionary (too full of noise to use)

What Else Do People Do? • Scientific Workflow • Design experiments in workflow environments • Lets each experiment be re-run and transparent • Lower level of noise • Of course, users must do all work in a foreign, and often times restrictive, environment

Observation • We can’t always (ever?) know what experiments will be fruitful before we run them • So, we may not want to setup a large experiment and design a workflow every time we try something • Corollary:We may not realize our results are good until some time after we first examine them

What Holds Us Back? • A lack of motivation? • Shouldn’t a solution be • Easy • Support automation that makes it worth doing. Why bother if it isn’t directly useful?

Goal • What we really want is to know how “day3.out” was generated because: • We need to be sure we did it right • We need to be able to show our collaborators that we aren’t smoking crack • We often want to re-run our analysis with new data • More? Let’s stop here for now…

How COULD We Do This? • Keep a manual lab-book file of all commands run • This is feasible, but very prone to both bloat and stale/missing/mistaken info • It’s a very manual process and a pain. You can’t copy-and-paste w/o stripping the prompts, etc. • Look at the history file • Multiple shells will cause holes in the history • What about commands issued in: R, gnuplot, etc? • An ideal solution… • Automatic, just specify start and stop points. • Wasted experiments are not a factor

Meaningless Eye Candy

lbsh (Pound-Shell) • Let’s provide lab-book support on the command line! • While typing we should be able to just “start an experiment” do some work, and then “stop” it • In addition, we should keep track of what files were accessed and modified during this • Goal: provide provenance for files based on lab-book entries

Level-Setting • lbsh is in alpha • The code works well, but there are certainly bugs • The features that are there are a starting point • Feedback is welcome • Tell me about bugs, tell me what you like, tell me what you dislike, etc • The page is hosted here, but there are links to sourceforge for bug tracking and feature reqs http://lbsh.cs.ucla.edu/

How Does it Work? • Lbsh is a monitor that spawns a worker shell and passes commands to it • When a user “starts an experiment” lbsh starts recording • The experiments are entered as separate lab-book entries

Specifically… • lbsh uses a user config file ($HOME/.lbshrc) • Records commands (even in R, etc.) • Stats files in a user-specified directory (atime/mtime) • Can repeat experiments • Is able to avoid repeating editor sessions (vi, emacs, etc.) • Can report the experimental provenance of individual files • i.e. “How did I get ‘day3.out’?”

Usage • To use lbsh, just launch it • To start/stop an experiment: • ctrl-b • To tell if lbsh is running, or if an experiment is running: • lbshrunning.sh -v • exprunning.sh -v • To find a file’s provenance: • file-provenance.pl • To re-run an old experiment: • exeggutor.pl <experiment ID>

Revisiting Example • sort data2.out > sorted-data2.out • awk ‘{print $1 “\t” $5}’ sorted-data2.out > dook • sort data3.out > sorted-data3.out • join -1 1 -2 1 dook sorted-data3.out > blah • vi script.sh • script.sh dook • awk ‘{tot+=$1*$1}END{print tot}’ blah > day1.out • vi blah.pl • blah.pl data1.out data2.out > day2.out • sort data1.out > sorted-data1.out • join -1 1 -2 1 dook sorted-data1.out > blah • vi blindluck.awk • blindluck.awk blah > day3.out

Real Experiments • This example is too simple to be interesting • Though simple is good • Let’s see the result of some real usage from a paper submission:

Contribution • What we want is to make reproducibility a foregone conclusion, not a pipedream • Can we do it? • lbsh is a simple tool that is NOT fool-proof • Evidence: I’ve already found ways to trick it • lbsh is just a useful tool that makes it easier for each of us to be more diligent • What lbsh really contributes is: • An automation framework for us to be more efficient, and more secure in our work (reproducing data, etc.) • An enabling technology for us to do better

Future • In addition to tending our own farm, can we build on someone else’s work now? • Ex: IMC requires datasets to be made public to be considered for best-paper • From public data, can I automatically see how someone got their results and try to do follow-on work? • Feature requests: • Svn support: version control some files • File cleanup • Fix NFS support

http://lbsh.cs.ucla.edu/

References [1] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Rec., 34(3):31–36, 2005.

Thanks! Questions? Ideas?

Enhancing Data Provenance in Research with lbsh: Improving Reproducibility