1 / 16

Issues in Automatic Provenance Collection

Issues in Automatic Provenance Collection. May 4, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences. Imagine …. Every computational object you created had complete provenance. You could identify the source and history of every object you ever received.

ronli
Download Presentation

Issues in Automatic Provenance Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in AutomaticProvenance Collection May 4, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

  2. Imagine … • Every computational object you created had complete provenance. • You could identify the source and history of every object you ever received. • You could query this complete history. • All these features worked regardless of what tools you used.

  3. What is Automatic Provenance Collection? • A system that makes all this happen. • Requires no user intervention. • Provenance collection is the default. • Works seamlessly and unobtrusively while you work using whatever tools you normally use. • Examples • Is this memo based upon confidential data? • Why do these two invocations produce different results?

  4. Provenance-AwareStorage Systems (PASS) • Storage systems (e.g., file systems) in which provenance is a first class entity. • Provenance: • is generated and maintained as transparently as possible. • can be indexed and queried. • will be created from objects imported from non-PASS sources. • is maintained in the presence of deletes, copies, renames, etc.

  5. env=“USER…” argv=“sort a” process name=“sort modules=“pasta…” kernel=“Linux…” File cache Collecting Provenance Kernel % sort a > b fork open b (W) exec “sort a” open a (R) read a write b close a close b input=sort input=a sort b a To file system

  6. Observed vs Disclosed Provenance • Observed provenance: • Extract provenance from stream of events • System does not control events • Disclosed provenance: • Application or user identifies provenance • Provenance disclosed to database • Examples: • User annotations • Provenance-aware applications • Workflow specifications

  7. Challenges in Observed Provenance • Granularity • Versions • Cycles • False provenance • Security

  8. Granularity • Automatic systems track provenance at the granularity that they see (files, tuples, …). • Users think about provenance in coarser, semantically meaningful terms (experiments, projects, workflows). • This mismatch leads to problems: • Users want to know about “gcc 4.0,” not its change history from the beginning of time.

  9. P Versions • Provenance + mutable data = versioning. • Consider: • Open A, read A, write A, close A • A’s provenance changed. • We implicitly created a new version. • The provenance system must preserve versions. • Avoiding excessive versions leads to … Read A A A’ Write A

  10. Cycles • Cannot really happen: A cannot be both B’s parent and B’s child. • Violate causality. • So, how do they happen? • Open A, read A, write A, close A • A is its own parent, unless you create A’. • But what if (read A, write A) is in a loop? • One version per loop iteration? • Ideally, one version for entire loop. • How do you identify the loops?

  11. P Cycles (2) Read A A Write A A’ Read A’ ?’

  12. Cycles (3) • The cycles can be arbitrarily complex. • Why do they happen in observed systems and not disclosed systems? • In disclosed systems, the disclosures are made by someone who knows how to do the grouping. • Cycle detection/breaking is automatically doing what the human is doing when s/he decides where and what to disclose. • Our algorithm is not as smart as people.

  13. False Provenance • Recorded provenance that did not affect the output. • Examples: • Many utilities read one or more start-up files, but not all those startup files affect every output. • A workflow might specify an input file that is only sometimes used. • Neither observed nor disclosed systems can avoid this completely.

  14. Security • Provenance and the data it describes have different security characteristics. • Protecting provenance requires protecting: • Attributes (e.g., command line, environment) • Relationships (e.g., ancestors) • Composition of security is hard. • Unfortunately, it is a requirement.

  15. Conclusions • Automatic collection is useful. • It is also challenging. • There is a ton of interesting research to do.

  16. Questions! • Thanks to: • Network Appliance • IBM Research • The Harvard PASS Team: Uri Braun, Simson Garfinkel, David Holland, Kiran-Kumar Muniswamy-Reddy • Participants in the October, 2005 PASS Workshop • Our users! http://www.eecs.harvard.edu/~margo/syrah

More Related