1 / 21

SkewReduce

Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. SkewReduce. YongChul Kwon Magdalena Balazinska , Bill Howe, Jerome Rolia * University of Washington, *HP Labs. Published in SoCC 2010. Motivation. Science is becoming a data management problem

marci
Download Presentation

SkewReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Published in SoCC 2010

  2. Motivation • Science is becoming a data management problem • MapReduce is an attractive solution • Simple API, declarative layer, seamless scalability, … • But it is hard to • Express complex algorithms and • Get good performance (14 hours vs. 70 minutes) • SkewReduce: • Goal: Scalable analysis with minimal effort • Toward scalable feature extraction analysis

  3. Application 1: Extracting Celestial Objects • Input • { (x,y,ir,r,g,b,uv,…) } • Coordinates • Light intensities • … • Output • List of celestial objects • Star • Galaxy • Planet • Asteroid • … M34 from Sloan Digital Sky Survey

  4. Application 2: Friends of Friends • Simple clustering algorithm used in astronomy • Input: • Points in multi-dimensional space • Output: • List of clusters (e.g., galaxy, …) • Original data annotated with cluster ID • Friends • Two points within ε distance • Friends of Friends • Transitive closure of Friends relation ε

  5. Parallel Friends of Friends • Partition • Local clustering • Merge • P1-P2 • P3-P4 • Merge • P1-P2-P3-P4 • Finalize • Annotate original data C2 C1 C5 C5 →C3 C3 P1 P3 P2 P4 C4 C6 C4→C3 C6 →C5 C6 →C3

  6. Parallel Feature Extraction • Partition multi-dimensional input data • Extract features from each partition • Merge (or reconcile) features • Finalize output INPUT DATA Features Map “Hierarchical Reduce”

  7. Problem: Skew Skew Local Clustering (MAP) Task ID Merge (REDUCE) 5 minutes 35 minutes 8 node cluster, 32map/32reduce slots Time The top red line runs for 1.5 hours

  8. Unbalanced Computation: Skew • Computation skew • Characteristics of an algorithm • Same amount of input data != Same runtime O(N log N) ~ O(N2) 0 friends per particle O(N) friends per particle Can we scale out off-the-shelf implementation without (or minimal) modifications?

  9. Solution 1?Micro partition • Assign tiny amount of work to each task to reduce skew

  10. How about having micro partitions? 8 node cluster, 32map/32reduce slots • It works! • Framework overhead! • To find sweet spot, need to try different granularities! Can we find a good partitioning plan without trial and error?

  11. Outline • Motivation • SkewReduce • API (in the paper) • Partition Optimization • Evaluation • Summary

  12. Partition Optimization Serial Feature Extraction Algorithm • Varying granularities of partitions • Can we automatically find a good partition plan and schedule? Merge Algorithm 1 5 9 13 6 11 7 10 8 15 3 12 4 14 2

  13. Approach Runtime Plan • Goal: minimize expected total runtime • SkewReduce runtime plan • Bounding boxes for data partitions • Schedule 1 Sample SkewReduce Optimizer 5 9 13 6 11 7 Cost functions 10 8 15 3 Cluster configuration 12 4 14 2

  14. Partition Plan Guided By Cost Functions “Given sample, how long will it take to process?” • Two cost functions: • Feature cost: (Bounding box, sample, sample rate) → cost • Merge cost:(Bounding boxes, sample, sample rate) → cost • Basis of two optimization decisions • How (axis, point) to split a partition • When to stop partitioning …

  15. Search Partition Plan • Greedy top-down search • Split if total expected runtime improves • Evaluate costs for subpartitions and merge • Estimate new runtime Original Possible Split 50 1 2 3 1 ACCEPT Schedule 1 = 60 3 1 3 2 10 2 Time 100 REJECT 50 Schedule 2 = 110

  16. Summary of Contributions • Given a feature extraction application • Possibly with computation skew • SkewReduce • Automatically partitions input data • Improves runtime in spite of computation skew • Key technique: user-defined cost functions

  17. Evaluation • 8 node cluster • Dual quad core CPU, 16 GB RAM • Hadoop 0.20.1 + custom patch in MapReduce API • Distributed Friends of Friends • Astro: Gravitational simulation snapshot • 900 M particles • Seaflow: flow cytometry survey • 59 M observations

  18. Does SkewReduce work? (18 GB, 3D) (1.9 GB, 3D) MapReduce • SkewReduce plan yields 2 ~ 8 times faster running time 1 hour preparation Hours Minutes

  19. Impact of Cost Function Astro Higher fidelity = Better performance

  20. Highlights of Evaluation • Sample size • Representativeness of sample is important • Runtime of SkewReduce optimization • Less than 15% of real runtime of SkewReduce plan • Data volume in Merge phase • Total volume during Merge = 1% of input data • Details in the paper

  21. Conclusion • Scientific analysis should be easy to write, scalable, and have a predictable performance • SkewReduce • API for feature extracting functions • Scalable execution • Good performance in spite of skew • Cost-based partition optimization using a data sample • Published in SoCC 2010 • More general version is coming out soon!

More Related