1 / 20

Predicting Execution Bottlenecks in Map-Reduce Clusters

Predicting Execution Bottlenecks in Map-Reduce Clusters. Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs. The Map Reduce (MR) Paradigm. Architecture for scalable information processing Simple API Computation scales to Web-scale data collections

gayle
Download Presentation

Predicting Execution Bottlenecks in Map-Reduce Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs

  2. The Map Reduce (MR) Paradigm • Architecture for scalable information processing • Simple API • Computation scales to Web-scale data collections • Google MR • Pioneered the technology in early 2000’s • Hadoop: open-source implementation • In use at Amazon, eBay, Facebook, Yahoo!, … • Scales to 10K’s nodes (Hadoop 2.0) • Many proprietary implementations • MR technologies at Microsoft, Yandex, …

  3. Computational Model Slowest task (straggler) affects the job latency M1 R1 M2 Input (on DFS) Output (on DFS) R2 M3 M4 Synchronousexecution: every R starts computing after all M’s have completed

  4. Predicting Straggler Tasks • Straggler tasks are an inherent bottleneck • Affect job latency, and to some extend throughput • Two approaches to tackle stragglers • Avoidance–reduce the probability of straggler emergence • Detection – once a task goes astray, speculatively fire a duplicate task somewhere else • This work – straggler prediction • Fits with both avoidance and detection scenarios

  5. Background • Detection, Speculative Execution • First implemented in Google MR (OSDI ’04) • Hadoop employs a crude detection heuristic • LATE scheduler (OSDI ‘08) addresses the issues of heterogeneous hardware. Evaluated on small scale. • Microsoft MR (Mantri project, OSDI ‘10) • Avoidance • Local/rack-local data access is preferred for mappers • … Network less likely to become the bottleneck • All optimizations are heuristic

  6. Machine-Learned vs Heuristic Prediction • Heuristics are hard to … • Tune for real workloads • Catch transient bottlenecks • Some evidence from Hadoop grids at Yahoo! • Speculative scheduling is non-timely and wasteful • 90% of the fired duplicates are eventually killed • Data-local computation amplifies contention • Can we use the wealth of historical grid performance data to train a machine-learned bottleneck classifier?

  7. Why Should Machine Learning Work? Huge recurrence of large jobs in production grids Target workload 95% of mappers and reducers belong to jobs that ran 50+ times in a 5-month sample

  8. The Slowdown Metric • Task slowdown factor • Ratio between the task’s running time and the median running time among the sibling tasks in the same job. • Root causes • Data skew– input or output significantly exceeds the median for the job • Tasks with skew > 4x happen really seldom. • Hotspots – all the other reasons • Congested/misconfigured/degraded nodes, disks, or network. • Typically transient. The resulting slow can be very high.

  9. Jobs with Mapper Slowdown > 5x Sample of ~50K jobs • 1% among all jobs • 5% among jobs with 1000 mappers or more • 40% due to data skew (2x or above), 60% due to hotspots

  10. Jobs with Reducer Slowdown > 5x Sample of ~60K jobs • 5% among all jobs • 50% among jobs with 1000 reducers or more • 10% due to data skew (2x or above), 90% due to hotspots

  11. Locality is No Silver Bullet Top contributor of straggler tasks over 6 hours • The same nodes are constantly lagging behind • Weaker CPUs (grid HW is heterogeneous), data hotspots, etc. • Pushing for locality too hard amplifies the problem!

  12. Slowdown Predictor • An oracle plugin into a Map Reduce system • Input: node features + task features • Output: slowdown estimate • Features • M/R metrics (job- and task-level) • DFS metrics (datanode-level) • System metrics (host-level: CPU, RAM, disk I/O, JVM, …) • Network traffic (host-, rack- and cross-rack-level)

  13. Slowdown Prediction - Mappers Mis-predicted, need improvement

  14. Slowdown Prediction - Reducers More dispersed than the mappers

  15. Some Conclusions • Data skew is the most important signal, but there are many more that are important • Node HW generation is a very significant signal for both mappers and reducers • Large grids undergo continuous HW upgrades • Network traffic features (intra-rack and cross-rack) is much more important for reducers than for mappers • How to collect efficiently in a real-time setting? • Need to enhance data sampling/weighting to capture outliers better

  16. Takeaways • Slowdown prediction • ML approach to straggler avoidance and detection • Initial evaluation showed viability • Need to enhance training to capture outliers better • Challenge: runtime implementation • A good blend with the modern MR system architecture?

  17. Thank you

  18. Machine Learning Technique • Gradient Boosted Decision Trees (GBDT) • Additive regression model • Based on ensemble of binary decision trees • 100 trees, 10 leaf nodes each …

  19. Challenges – Hadoop Use Case • Hadoop 1.0 – centralized architecture • The single Job Tracker process manages all task assignment and scheduling • Full picture of Map and Reduce slots across the cluster • Hadoop 2.0 – distributed architecture • Resource management and scheduling functions split • Thin centralized Resource Manager (RM) creates application containers (e.g., for running a Map Reduce job) • Per-job App Master (AM) does scheduling within a container • May negotiate resource allocation with the RM • Challenge: working with a limited set of local signals

  20. Possible Design – Hadoop 2.0 Centralized prediction will not scale. Will distributed prediction be accurate enough? New component or API Application Master Model App Container Creation Resource Manager Metrics collection (extends the existing HB protocol) Resource requests Node Manager Node Manager Node Manager • Some metrics already collected (CPU ticks, bytes R/W) • Others might be collected either by NM, or externally Metrics Metrics Metrics Job Execution Environment

More Related