1 / 25

Overprovisioning for Performance Consistency in Grids

Overprovisioning for Performance Consistency in Grids. Nezih Yigitbasi and Dick Epema. P arallel and Distributed Systems Group Delft University of Technology. http://guardg.st.ewi.tudelft.nl/. The Problem: Performance inconsistency in grids. Inconsistent performance common in grids

von
Download Presentation

Overprovisioning for Performance Consistency in Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi andDick Epema Parallel and Distributed Systems Group Delft University of Technology http://guardg.st.ewi.tudelft.nl/

  2. The Problem: Performance inconsistency in grids • Inconsistent performance common in grids • bursty workloads • variable background loads • high rate of failures • highly dynamic & heterogeneous environment How can we provide consistent performance in grids? Bag-of-Tasks with 128 tasks submitted every 15 minutes ~70X

  3. Our goals GOAL-1 Realistic performance evaluation of static and dynamic overprovisioning strategies (system’s perspective) GOAL-2 Dynamically determine the overprovisioning factor (Κ) for user specified performance requirements (user’s perspective)

  4. Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions

  5. Overprovisioning (I) • Increasing the system capacity to provide better, and in particular, consistentperformance even under variable workloads and unexpected demands Pros • simple • obviates the need for complex algorithms • easy to deploy & maintain Cons • cost-ineffective • workloads may evolve (e.g., increasing user base) • lowly-utilized systems

  6. Overprovisioning (II) • High overprovisioning factors (Κ) are common in modern systems • Google: 450,000 (2005) • Microsoft: 218,000 (mid-2008) • Facebook: 10,000+ (2009) • Preferred way of providing performance guarantees • typical data center utilization is no more than 15-50% • telecommunication systems have ~30% on average L. A. Barroso and U. Hölzle, The Case for Energy-Proportional Computing, IEEE Computer, December 2007.

  7. Overprovisioning strategies Dynamic Static Waste 1. Static • Largest • All • Number • Where should we deploy the resources? • Does it make any difference? 2. Dynamic • Dynamic overprovisioning • a.k.a. auto-scaling • low/high thresholds for acquiring/releasing resources • Given Κ, it is straightforward to determine the number of processors for a strategy Demand Capacity Time

  8. Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions

  9. System model global queue local queues GRM • DAS-3 multi-cluster grid • Global Resource Managers (GRM) interacting with Local Resource Managers (LRM) LRM LRM LRM global job local jobs

  10. Workload • Realistic workloads consisting of Bag-of-Tasks (BoT) • Simulations using 10 workloads with 80% load • each workload has ~1650 BoTs and ~10K tasks • duration of each workload is [1 day-1week] • Real background load trace • DAS-3 trace of June’08 (http://gwa.ewi.tudelft.nl/) (Distribution parameters are determined after base-two log transformation)

  11. Scheduling model • We consider the following BoT scheduling policies • Static Scheduling • statically partitions tasks across clusters • Dynamic Scheduling • takes cluster load into account • Dynamic Per Task Scheduling • Dynamic Per BoT Scheduling • Prediction-based Scheduling • average of the last two runtimes for prediction • sends the task to the cluster which is predicted to lead to the earliest completion time (ECT)

  12. Methodology • Compare the overprovisioned system with the initial system (NO) • For Dynamic • 69/129 s and 18/23 s for min/max acquisition/release • 60%/70% for low/high thresholds • Κvaries over time so for a fair comparison keep it in ± 10% range

  13. Traditional performance metrics Makespan of a BoT Difference between the earliest time of submission of any of its tasks, and the latest time of completion of any of its tasks Normalized Schedule Length (NSL) of a BoT Ratio of its makespan to the sum of the runtimes of its tasks on a reference processor (slowdown) Makespan First task submitted Last task done

  14. Consistency metrics • We define two metrics to capture the notion of consistency across two dimensions • System gets more consistent as Cd gets closer to 1, Cs gets closer to 0 • A tighter range of the NSL is a sign of better consistency

  15. Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions

  16. Performance of scheduling policies Dynamic Per Task is the best ECT is the worst

  17. Performance of different strategies Different Strategies Different Overprovisioning Factors (Κ) • Consistency obtained with overprovisioning is much better than the initial system (NO) • Static strategies provide similar performance (only K matters) • All and Largest are viable alternatives to Number as Number increases the administration, installation, and maintenance costs • Dynamic strategy has better performance compared to static strategies • K= 2.5 is the critical value

  18. Cost of different strategies • Use CPU-Hours • time a processor is used [h] • round up a partial instance-hours to one hour similar to the Amazon EC2 on-demand instances pricing model • Significant reduction, as high as ~40%, in cost

  19. Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions

  20. Determining Κ dynamically • So far system’s perspective, now user’s perspective • How can we dynamically determine Κ given the user performance requirements? • We use a simple feedback-control approach to deploy additional resources dynamically to meet user performance requirements

  21. Evaluation • Simulated DAS-3 without background load • ~1.5 month workload consisting of ~33K BoTs • Empirically show that the controller stabilizes • Average makespan for the workload in the initial system (without the controller) is ~3120 minutes • Three scenarios from tight to loose performance requirements • [250m-300m] • [700m-750m] • [1000m-1250m]

  22. Results (I) • Significant improvement, as high as ~65%, when the performance requirements are tight • ~40%-50% improvement for loose performance requirements

  23. Results (II) [700m-750m] [250m-300m] [1000m-1250m]

  24. Conclusions GOAL-1: Realistic Performance Evaluation of Different Strategies • Overprovisioning improves performance consistency significantly • Static strategies provide similar performance (only K matters) • Dynamic strategy performs better than the static strategies • Need to determine the critical value to maximize the benefit of overprovisioning GOAL-2: Dynamically Determining Κ for Given User Performance Requirements • Feedback-controlled system tuning K dynamically using historical performance data and specified performance requirements • The number of BoTs meeting the performance requirements increases significantly, as high as 65%, compared to the initial system

  25. Thank you! Questions? Comments? “M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/ • More Information: • Guard-g Project: http://guardg.st.ewi.tudelft.nl/ • PDS publication database: http://www.pds.twi.tudelft.nl

More Related