1 / 21

GRASS : Trimming Stragglers in Approximation Analytics

GRASS : Trimming Stragglers in Approximation Analytics. Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu. Approximation Jobs. Data deluge  timely answers from part of dataset are often good enough

iram
Download Presentation

GRASS : Trimming Stragglers in Approximation Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu

  2. Approximation Jobs • Data deluge  timely answers from part of dataset are often good enough • Deadline-bound Jobs: Maximize accuracy within a deadline • E.g., get me the best ad to show within 1s • Error-bound Jobs: Minimize time taken to achieve desired accuracy • E.g., get me the #cars sold to the nearest 1000

  3. Prioritize Tasks • Spawn jobs on large dataset; subset of tasks need to complete • Accuracy α (Subset of completed tasks) • Prioritize tasks • Jobs run only part of their tasks at a time (multi-waved) • NP-Hard but many known heuristics…

  4. Stragglers • Occur despite prevention mechanisms • Speculation: Spawn a duplicate, earliest wins Challenge:dynamically prioritize while choosing between speculative/unscheduled tasks

  5. Opportunity Cost • Speculative copies consume extra resources T1 Is speculation worth the payoff? Slot 1 Slot 2 T2 Slot 3 • T1 T3 time 0 5 10 9

  6. Roadmap • Two extreme approaches • Switching between the two approaches • Evaluation

  7. Greedy Scheduling (GS) • Schedule the task that greedilyimproves accuracy, i.e., finish earliest Accuracy = 7/9 Slot 1 T1 T1 Deadline = 6 Slot 2 T2 • T3 • T4 • T5 • T6 • T7 time 3 0 1 6

  8. Resource Aware Scheduling (RAS) • Speculate only if (c)trem > (c+1)tnew Accuracy = 8/9 Slot 1 T1 • T4 • T6 • T7 T1 Deadline = 6 Slot 2 T2 • T1 • T3 • T5 • T8 time 3 0 1 6

  9. GS vs. RAS Accuracy = 7/9 Accuracy = 3/9 Slot 1 T1 T1 T1 GS Deadline = 6 Deadline = 3 Slot 2 T2 • T3 • T4 • T5 • T6 • T7 time 3 0 1 6 Accuracy = 2/9 Accuracy = 8/9 Slot 1 T1 • T4 • T6 • T7 T1 RAS Deadline = 6 Deadline = 3 Slot 2 T2 • T1 • T3 • T5 • T8 time

  10. How to pick between GS and RAS? • When to consider opportunity cost? • Analytical model [TBD]

  11. Guideline • Be “conservative” early in the job, and “aggressive” near its completion For jobs with >2 waves of tasks use RAS, otherwise use GS

  12. GRASS = GS+ RAS • Start with RAS and switch to GSwhen 2 waves of tasks remain • How to estimate two remaining waves? • Wave boundaries are not strict • Task durations are uncertain • Wave-width is not constant

  13. Learning switching point in GRASS • GRASS starts with RAS and switches to GS close to the deadline/error-bound • Learn the switching point from completed jobs • Generate GS-only and RAS-only samples • Switch when the GS-only samples are better for the remaining deadline

  14. GRASS Scheduling • GS  Greedy Scheduling • RAS  Opportunity cost aware Scheduling • Switch from RAS to GS close to deadline/error-bound • Learn switching point

  15. Implementation Details • Implemented inside Hadoop 0.20.2 and Spark 0.7.3 • Modified Fair Scheduler • Task Estimators • trem is extrapolated from remaining data; progress reports at 5% intervals • tnew is estimated from completed tasks

  16. Evaluating GRASS • Workload from Facebook and Bing traces • 1000’s of nodes, Hadoop and Dryad jobs • How was it generated? • Baselines: LATE orMantri • 200 node EC2 deployment; trace-driven sim.

  17. Deadline-bound Jobs

  18. Error-bound Jobs

  19. Value of Switching

  20. Related Work

  21. Conclusion

More Related