provenance for generalized map and reduce workflows n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Provenance for Generalized Map and Reduce Workflows PowerPoint Presentation
Download Presentation
Provenance for Generalized Map and Reduce Workflows

Loading in 2 Seconds...

play fullscreen
1 / 28

Provenance for Generalized Map and Reduce Workflows - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Provenance for Generalized Map and Reduce Workflows. Robert Ikeda , Hyunjung Park, Jennifer Widom Stanford University. Pei Zhang Yue Lu. Provenance. Where data came from How it was derived, manipulated, combined, processed, … How it has evolved over time Uses: Explanation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Provenance for Generalized Map and Reduce Workflows


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Provenance for GeneralizedMap and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei ZhangYue Lu

    2. Provenance • Where data came from • How it was derived, manipulated, combined, processed, … • How it has evolved over time • Uses: • Explanation • Debugging and verification • Recomputation

    3. The Panda Environment I1 • Data-oriented workflows • Graph of processing nodes • Data sets on edges • Statically-defined; batch execution; acyclic … O In

    4. Provenance • Backward tracing • Find the input subsets that contributed to a given output element • Forward tracing • Determine which output elements were derived from a particular input element Twitter Posts Movie Sentiments

    5. Provenance • Basic idea • Capture provenance one node at a time(lazy or eager) • Use it for backward and forward tracing • Handle processing nodes of all types

    6. Generalized Map and Reduce Workflows What if every nodewere a Map or Reduce function? • Provenance easier to define, capture, and exploit than in the general case • Transparent provenance capture in Hadoop • Doesn’t interfere with parallelism or fault-tolerance R M R M M

    7. Remainder of Talk Defining Map and Reduce provenance Recursive workflow provenance Capturing and tracing provenance System description and performance Future work

    8. Remainder of Talk • Defining Map and Reduce provenance • Recursive workflow provenance • Capturing and tracing provenance • System description and performance • Future work • Surprising theoretical result • Implementation details

    9. Example Good Movies Tweets TweetScan TM AM Filter Aggregate Diggs Bad Movies DM DiggScan

    10. Transformation Properties Deterministic Functions. Multiplicity for Map Functions Multiplicity for Reduce Functions Monotonicity

    11. Map and Reduce Provenance • Map functions • M(I) = UiI (M({i})) • Provenance of oO is iI such that oM({i}) • Reduce functions • R(I) = U1≤k ≤ n(R(Ik)) I1,…,Inpartition I on reduce-key • Provenance of oO is Ik  I such that oR(Ik)

    12. Workflow Provenance E1 I*1  I1 • Intuitive recursive definition • Desirable “replay” property oW(I*1,…, I*n) R M … … o O O R M I*n  In M E2 • Usually holds, but not always

    13. Replay Property Example M R R Twitter Posts #Movies Per Rating TweetScan Count Summarize Rating Medians Inferred Movie Ratings “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay”

    14. Replay Property Example M R R Twitter Posts #Movies Per Rating TweetScan Count Summarize Rating Medians Inferred Movie Ratings “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay”

    15. Replay Property Example M R R Twitter Posts #Movies Per Rating TweetScan Count Summarize Rating Medians Inferred Movie Ratings “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay”

    16. Replay Property Example M R R Twitter Posts #Movies Per Rating TweetScan Count Summarize Rating Medians Inferred Movie Ratings “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay”

    17. Replay Property Example M R R Twitter Posts #Movies Per Rating TweetScan Count Summarize Rating Medians Inferred Movie Ratings One-Many Function Nonmonotonic Reduce Nonmonotonic Reduce “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” 2 7

    18. Capturing and Tracing Provenance • Map functions • Add the input ID to each of the output elements • Reduce functions • Add the input reduce-key to each of the output elements • Tracing • Straightforward recursive algorithms

    19. RAMP System • Built as an extension to Hadoop • Supports MapReduce Workflows • Each node is a MapReduce job • Provenance capture is transparent • Retaining Hadoop’s parallel execution and fault tolerance • Users need not be aware of provenance capture • Wrapping is automatic • RAMP stores provenance separately from the input and output data

    20. RAMP System: Provenance Capture • Hadoop components • Record-reader • Mapper • Combiner (optional) • Reducer • Record-writer

    21. RAMP System: Provenance Capture Input Input Wrapper RecordReader RecordReader p (ki, vi) (ki, vi) (ki, 〈vi, p〉) Wrapper (ki, vi) Mapper Mapper p (km, vm) (km, vm) (km, 〈vm, p〉) Map Output Map Output

    22. RAMP System: Provenance Capture Map Output Map Output (km, [〈vm1, p1〉,…, 〈vmn, pn〉]) (km, [vm1,…,vmn]) Wrapper (km, [vm1,…,vmn]) Reducer Reducer (ko, vo) (ko, vo) (ko, 〈vo, kmID〉) Wrapper (ko, vo) q RecordWriter RecordWriter (kmID, pj) (q, kmID) Provenance Output Output

    23. Experiments • 51 large EC2 instances (Thank you, Amazon!) • Two MapReduce “workflows” • Wordcount • Many-one with large fan-in • Input sizes: 100, 300, 500 GB • Terasort • One-one • Input sizes: 93, 279, 466 GB

    24. Results: Wordcount

    25. Results: Terasort

    26. Summary of Results • Overhead of provenance capture • Terasort • 20% time overhead, 21% space overhead • Wordcount • 76% time overhead, space overhead depends directly on fan-in • Backward-tracing • Terasort • 1.5 seconds for one element • Wordcount • Time directly dependent on fan-in

    27. Future Work • RAMP • Selective provenance capture • More efficient backward and forward tracing • Indexing • General • Incorporating SQL processing nodes

    28. PANDAA System for Provenance and Data “stanford panda”