1 / 60

PANDA A System for Provenance and Data

PANDA A System for Provenance and Data. Jennifer Widom — Stanford University. Example: Sales Prediction Workflow. CustList 1. Europe. CustList 2. . . . Union. Predict. Dedup. ItemAgg. Split. CustList n-1. USA. Item Sales. Buying Patterns. Catalog Items. CustList n.

lexiss
Download Presentation

PANDA A System for Provenance and Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PANDAA System for Provenance and Data Jennifer Widom — Stanford University

  2. Example: Sales Prediction Workflow CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  3. Example: Sales Prediction Workflow Backward Tracing CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn ? ? ?

  4. Example: Sales Prediction Workflow Backward Tracing CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA USA Item Sales Buying Patterns Catalog Items CustListn CustListn ?

  5. Example: Sales Prediction Workflow Forward Propagation Backward Tracing CustList1 Europe Europe CustList2 . . . Union Predict Dedup ItemAgg Split Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn CustListn

  6. Provenance Where data came from How it was derived, manipulated, combined, processed, … How it has evolved over time

  7. Uses for Provenance Sources and evolution of data; deeper understanding • Buggy or stale source data? Buggy processing? • Error propagation paths • Auditing Propagate changes to affected “downstream” data Explanation Debugging and Verification Recomputation

  8. Some Application Domains • Sales prediction workflows  • Scientific-data workflows • Including human-curated data • Including evolving versions of data • Any analytic pipeline • “Extract-transform-load” (ETL) processes • Information-extraction pipelines

  9. Third Time’s a Charm 1. Data Warehousing project (long ago) • Lineage of relational views in warehouse: • formal foundations, system/caching issues • Lineage in ETL pipelines: foundations & algorithms 2. “Trio” project (recently) • Data + Uncertainty + Lineage • Lineage primarily in support of uncertainty Isn’tprovenancethe same thing as lineage? Pretty much Haven’t you worked on it before? Yes

  10. Panda’s Ambitions Panda will… Previous provenance work tends to be… • Either data-based or process-based • Either fine-grainedor coarse-grained • Focused on modeling and capturing provenance • Geared to specific functions or domains Capture both: “data-oriented workflows” Cover the spectrum in a unified fashion Also support provenance operators and queries End with a general-purpose open-source system

  11. Remainder of Talk Fundamentals Capturing provenance Exploiting provenance Concrete progress and results What’s next

  12. Remainder of Talk • Fundamentals • Data-oriented workflows • Provenance model • Capturing provenance • Exploiting provenance • Concrete progress and results • What’s next

  13. Remainder of Talk • Fundamentals • Data-oriented workflows • Provenance model • Capturing provenance • Exploiting provenance • Backward tracing & forward tracing • Forward propagation & refresh • Ad-hoc queries • Concrete progress and results • What’s next

  14. Data-Oriented Workflows • Graph of processing nodes; data sets on edges • Assume (for now): • Statically-defined; batch execution; acyclic • Don’t assume (for now): • Specific types of data sets or processing nodes

  15. Processing Nodes • Goal • Exploit knowledge and properties when present • Provide fallback when processing is opaque • Sample properties • Known relational operator or query • Monotonic • One-many or many-one • Map functionorReduce function

  16. Processing Nodes • General principle • Stronger properties  finer-grained input-output data relationships; more useful and efficient provenance • Sample properties • Known relational operator or query • Monotonic • One-many or many-one • Map functionorReduce function

  17. Processing Nodes: Example CustList1 Known relational operators Many-one, nonmonotonic One-one, monotonic Opaque Europe Europe CustList2 . . . Predict Predict Union Dedup Dedup Union ItemAgg ItemAgg Split Split CustListn-1 USA USA Item Sales Buying Patterns Catalog Items CustListn

  18. Provenance Model • Ultimate goals: • Support provenance at spectrum of granularities • Mesh data-oriented and process-oriented provenance • Composability/transitivity • For now, simple underlying model: • Mappings between input and output data elements Understandability

  19. Provenance Capture • Processing nodes provide provenance • information along with output Eager — generated at data-processing time versus Lazy — “tracing procedure”

  20. Processing Capture: Example CustList1 • Relational operators ― automatic, • previous work, eager or lazy • Dedup ― eager easy, lazy hard • One-ones ― eager or lazy easy • Predict ― it depends Europe Europe CustList2 . . . Predict Predict Union Dedup Dedup Union ItemAgg ItemAgg Split Split CustListn-1 USA USA Item Sales Buying Patterns Catalog Items CustListn Worst Case: No access to fine-grained provenance

  21. Provenance Operations — Basic • Backward tracing • Where did the Cowboy Hat record come from? • Forward tracing • Which sales predictions did Amelie contribute to? CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  22. Additional Functionality • Forward propagation • Update all affected predictions after customers • move from Texas to France CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  23. Additional Functionality • Refresh • Get latest prediction for Cowboy Hat sales (only) • based on modified buying patterns • ≈Backward tracing + Forward propagation CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  24. Provenance Queries • How many people from each country contributed to the Cowboy Hat prediction? • Which customer list contributed the most to the top 100 predicted items? CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  25. Provenance Queries • For a specific customer list, which items have higher demand than for the entire customer set? • Which customers have more duplication — those processed by USA or by Europe? CustList1 Europe CustList2 . . . Union Predict Dedup ItemAgg Split CustListn-1 USA Item Sales Buying Patterns Catalog Items CustListn

  26. Provenance Queries • For a specific customer list, which items have higher demand than for the entire customer set? • Which customers have more duplication — those processed by USA or by Europe? • Query language goals • Declarative ad-hoc queries à la database systems • Seamlessly combine provenance and data • Amenable to optimization

  27. Concrete Progress and Results 1. Provenance predicates • Motivated by making refresh problem concrete • Drove initial Panda prototype 2. Attribute mappings 3. Generalized map and reduce workflows • Provenance capture, backward tracing, • forward tracing, forward-propagation, • refresh • Ad-hoc queries, optimizations

  28. Concrete Progress and Results 1. Provenance predicates • Motivated by making refresh problem concrete • Drove initial Panda prototype 2. Attribute mappings 3. Generalized map and reduce workflows • Predicates  Attribute Mappings  GMRWs

  29. Provenance Predicates Provenance of output o is σp(I) I o O

  30. Provenance Predicates Provenance of output oi is σpi(I) • Worst case: pi = TRUE • Think: formalism to be instantiated • Predicates can have compact representations • Predicates can sometimes be generated automatically • Natural recursive definition • Extends to multiple inputs/outputs {[o1 , p1], [o2 , p2], …, [om , pm]} I Captures most existing provenance definitions

  31. Selective Refresh Problem • Exploit provenance to efficiently compute the • up-to-date value of selected output elements • after the input (or processing nodes) may have changed

  32. Selective Refresh Problem • Refreshing oi through one processing node P • Backward trace • Forward propagate P {[o1 , p1], [o2 , p2], …, [om , pm]} I Inew • I* = σpi(Inew) • onew = P(I*)

  33. Selective Refresh Problem • Refreshing oi through one processing node P P {[o1 , p1], [o2 , p2], …, [om , pm]} Inew … [ (Beret,high), item=‘Beret’ ] … … [ (Beret,medium), item=‘Beret’ ] … I Inew ItemAgg Refresh • σitem=‘Beret’

  34. Selective Refresh Problem • Refreshing oi through one processing node P P {[o1 , p1], [o2 , p2], …, [om , pm]} Inew Properties of processing nodes and their provenance • I* = σpi(Inew) • Does this always “work”? • Does it make sense? • onew = P(I*)

  35. Selective Refresh Problem • Refreshing oi through entire workflow • Backward trace recursively 2) Forward propagate through workflow I1 { … [oi , pi] … } I2 + Properties of workflow • Does this always “work”? • Does it make sense? • Is it efficient?

  36. Selective Refresh Example Euros2$ CitySum City=‘Paris’ Person=‘Amelie’ Person=‘Pierre’ City=‘Paris’ Person=‘Amelie’ Person=‘Pierre’ City=‘Paris’ Person=‘Amelie’ Person=‘Pierre’ Person=‘Marie’

  37. Required Properties

  38. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Create Table SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  39. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Create SQL Transformation SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  40. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Create Python Transformation SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  41. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Backward Trace SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  42. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Forward Trace SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  43. Panda System (version 0.1) Graphical Interface Command-line Client Panda Layer Refresh SQLite File System Data Tables (user) Workflow Table (Panda) Python Transformations (user) SQL Transformations (user) Provenance Predicate Tables (Panda) Forward Filter Tables (Panda)

  44. Attribute Mappings Attribute mapping: I.AO.B Provenance of output oO is:σI.A=o.B(I) More generally: x:σB=x(O) = P(σA=x(I)) P O (B, …) I (A, …) ItemAgg I.Item  O.item I(cust,item,prob) O(item,sales)

  45. Attribute Mappings Attribute mapping: I.AO.B Provenance of output oO is: σI.A=o.B(I) More generally: x:σB=x(O) = P(σA=x(I)) • Can generate automatically in many cases (e.g., SQL) • Worst case: { } { } • Generalize to Datalog-like rules • I(__, item, __) :- O(item, __) • I(__, item, prob) :- O1(item, __)prob > .95 • I(__, item, prob) :- O2(item, __)prob ≤ .95 • Allow functions • I.name  ToCaps(O.name)

  46. Attribute Mappings Attribute mapping: I.AO.B Provenance of output oO is: σI.A=o.B(I) More generally: x:σB=x(O) = T(σA=x(I)) • Rules for AMs: combining, splitting, transitivity • soundness and completeness • “Strongest possible mapping”  

  47. Provenance Operations Backward and forward tracing Forward propagation and refresh Key challenge: “broken chains” Proofs of correctness and minimality

  48. Generalized Map and Reduce Workflows What if every transformation was a Map or Reduce function? Very specific properties Provenance easier to define, capture, and exploit Automatic wrapping, doesn’t interfere with parallelism R M R M M

  49. Map and Reduce Provenance • Map functions • M(I) = UiI(M({i})) • Provenance of oOis iIsuch that oM({i}) • Reduce functions • R(I) = U1≤k ≤ n(R(Ik)) I1,…,Inpartition I on reduce-key • Provenance of oOis Ik  I such that oR(Ik)

  50. Recursive MR Provenance • Intuitive recursive definition Workflow W with inputs I1,…,In; output element o PW(o) = (I*1,…, I*n) I*1  I1, …, I*n In • Desirable property oW(I*1,…, I*n) R M R M M • Usually holds, but not always

More Related