1 / 36

Power-Aware Parallel Job Scheduling

Power-Aware Parallel Job Scheduling. Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero. {maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es. Power Consumption of Supercomputing Systems.

lotta
Download Presentation

Power-Aware Parallel Job Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power-Aware Parallel Job Scheduling Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero {maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es

  2. EEHiPC'10 Power Consumption of Supercomputing Systems • Striving for performance has led to enormous power dissipation of HPC centers (Top500 list) KWatts

  3. EEHiPC'10 Power reduction approaches in HPC Power reduction approaches Application level: - Runtime systems: - exploit certain application characteristics (load imbalance, communication intensive regions) - based on very fine grain DVFS application System level: - Turning off idle nodes: - resource allocation such that there are more completely idle nodes - determining number of online nodes - Operating system power management via DVFS: - linux governors – per core, unawareness of the rest of the system - DVFS taking into the account entire system workload?

  4. EEHiPC'10 Parallel Job Scheduling Wait Queue Queued jobs • Job scheduler has a global view of the whole system Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager

  5. EEHiPC'10 DVFS and Job Scheduling Wait Queue Queued jobs Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager Power-Aware Component Job CPU frequency assignment based on goals/constraints

  6. EEHiPC'10 Outline • Parallel job scheduling: • short introduction to parallel job scheduling • the EASY backfilling policy • Power and run time modelling: • first we need to understand how frequency scaling affects CPU power dissipation and runtime • Energy-saving parallel job scheduling policies: • Utilization-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010] • BSLD-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops 2010 Atlanta, GA, April 2010] • Power-budgeting: • how to maximize job performance under a given power budget? [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010]

  7. EEHiPC'10 About Parallel Job Scheduling CPUs • Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled: • FCFS policy used in the beginning • Backfilling policies introduced to improve system utilization • Job performance metrics: • Response time: WaitTime(J)+RunTime(J) • Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J) • Bounded Slowdown: max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1) Job 5 Job 3 Job 2 Job 1 Job 4 Job 6 Time Job Performance Wait Time Run Time

  8. EEHiPC'10 The EASY backfilling policy • Jobs are executed in FCFS order except when the first job in the wait queue can not start • Users have to submit an estimation of job's runtime – requested time • When the first job in the WQ can not start, a reservation is made for it based on requested times of running jobs • A job is executed before previously arrived ones only if it does not delay the first job in the wait queue CPUs Arrival of Job 5 Arrival of Job 6 Job 5 Job 1 Job 2 MakeJobReservation(Job5) Job 5 Job 3 BackfillJob(Job6) Job 4 Job 6 Time

  9. High-Level DVFS modelling

  10. EEHiPC'10 Power Model • CPU power presents one of main system power components • It consists of dynamic and static power: • Pcpu = Pdynamic + Pstatic • Pdynamic = AcfV2 Pstatic = α V • Fraction of static in total CPU power is a model parameter: • Pstatic(Vtop) = X(Pstatic(Vtop) + Pdynamic (ftop,Vtop)) ( X = 25% in our experiments ) • Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity) • Idle processors: do not consume power/ consume power at the lowest frequency • DVFS gear set :

  11. EEHiPC'10 Number of CPUs Distribution Less or equal to 4 N(0.5, 0.01) Between 4 and 32 N(0.4, 0.01) More than 32 N(0.3, 0.064) Time Model • Execution time dependence on frequency is captured by the following model: F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1 [Hsu,Feng SC05: A Power-Aware Run Time System for High-Performance Computing] • ß is assumed to have the following normal distributions: • Global application ß depends on communication/computation ratio • Two ß scenarios: • ß is known in advance (at the moment of scheduling) • ß is not known in advance (at the moment of scheduling the worst case, ß = 1, is assumed ) ß=0.7 ß=0.5 ß=0.3

  12. Energy Saving Parallel Job Scheduling Policies

  13. EEHiPC'10 Utilization-Driven Policy • Frequency assigned once (at jobs start time) for entire job execution based on system utilization • Utilization is computed for each interval T: • An additional control over system load WQthreshold: • If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied • Otherwise, job started during interval Jkruns at frequency F ftop Fk fupper flower Uk-1 Ulower Uupper

  14. EEHiPC'10 Evaluation Alvio simulator • C++ event driven parallel job scheduling simulator has been upgraded • Policy parameters: utilization thresholds: Ulower= 50% Uupper= 80% reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz utilization computation interval: T = 10 min wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit • Metric of job performance – Bounded Slowdown: • BSLD at frequency f : Policy parameters Metric of performance

  15. EEHiPC'10 Workloads • Five workloads from production use have been simulated: Cornell Theory Center -large jobs with relatively low level of parallelism San Diego Supercomputing Center -less sequential jobs than CTC -runtime distribution similar Lawrence Livermore National Lab - small to medium size jobs Lawrence Livermore National Lab - large parallel jobs San Diego Supercomputing Center - no sequential job Parallel workload archive http://www.cs.huji.ac.il/labs/parallel/workload

  16. EEHiPC'10 Results: Normalized CPU Energy short wait queues very similar results for both energy scenarios savings of not highly loaded workloads up to 12%

  17. EEHiPC'10 Results: Normalized Performance high penalty in the least conservative case for highly loaded workload WQ threshold has almost no impact an increase in number of backfilled jobs

  18. EEHiPC'10 Average frequency - SDSCBlue

  19. EEHiPC'10 BSLD-Driven Policy • Frequency is assigned based on job's predicted performance • Lower frequency -> longer execution time • -> worse job performance metric • BSLDth controls allowable performance penalty (“target BSLD”) • In order to be run at lower frequency f a job has to satisfy BSLD condition at frequency f: if the job's predicted BSLD at frequency f is lower thanBSLDththan it satisfies the BSLD condition at frequency f • Predicted BSLD: Job Ji WQsize≤ WQthreshold NO Run Jiat Ftop YES f = Flowest find an allocation Alloc satisfiesBSLD(Alloc,Ji,f) or f=Ftop NO f = next higher frequency YES Run job Ji at frequency f

  20. EEHiPC'10 Results: Normalized CPU Energy Normalized energies in two energy scenarios behave in the same way Average savings in the most aggressive case: 5% - 23% Difference in savings per workload for the most conservative and the most aggressive threshold combinations goes from 5% (SDSC) to 15% (LLNLThunder) WQthresholdcontrols DVFS aggressiveness much better than BSLDthreshold BSLDthresholdhas stronger impact when WQthresholdishigher

  21. EEHiPC'10 Average BSLD 24.91 Strong impact on performance in the most aggressive case Impact of WQthresholdhigher than of BSLDthreshold BSLDthresholdhas stronger impact when WQthreshold is higher 1 1.08 5.15 4.66 Decrease in performance is proportional to energy savings

  22. EEHiPC'10 Reduced jobs (out of 5000)‏ Performance depends on the number of reduced jobs It depends on used frequencies as well It was remarked that performance of jobs that have been run at the nominal frequency was affected as well When load is very high (SDSC) no DVFS is applied (in order to apply it thresholds have to be set to higher values)

  23. EEHiPC'10 Wait time Main problem observed: -> high impact on wait time Zoom of SDSCBlue wait time behavior

  24. Power-Budgeting Policy

  25. EEHiPC'10 PB-Guided Policy: How DVFS can improve overall job performance NO DVFS CASE J3 J5 ftop J1 Wait Queue: J2 J4 J5 Time T1 T2 ftop J4 J5 J3 DVFS CASE J4 Power Budget flower J2 J3 J1 J2 J1 penalty in run time due to frequency scaling but more jobs can run simultaneously

  26. EEHiPC'10 Power Budgeting: PB-Guided Policy • Frequency assignment is guided by predicted job performance and current power draw • Prediction of BSLD when selecting frequency: • BSLD condition: A job satisfies BSLD condition at reduced frequency f if its predicted BSLD at the frequency f is lower than current value of the BSLD threshold • The policy is power conservative: • A job will be scheduled at the lowest frequency at which both BSLD condition and power limit are satisfied • The closer to the PB limit, the • higher the BSLD threshold • The higher the BSLD threshold, • the lower frequency will be selected BSLD threshold Pcurrent 0 Plower Pupper Power Budget

  27. EEHiPC'10 Power Budgeting: PB-Guided Policy MakeJobReservation(J)‏ 1: scheduled <-- false; 2: shiftInTime <-- 0; 3: nextFinishJob <-- next(OrderedRunningQueue); 4: while( !scheduled)‏ { 5: f <-- FlowestReduced 6: while(f < Fnominal) { 7: Alloc = findAllocation(J,currentTime + shiftInTime,f); 8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) ) 9: { schedule(J, Alloc); 10: scheduled <-- true; 11: break; } 12: if (f == Fnominal) 13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal) 14: if (satisfiesPowerLimit(Alloc, J,Fnominal)) 15: schedule(J, Alloc); 16: break; 17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime; 18: nextFinishJob <-- next(OrderedRunningQueue); } BackfillJob(J)‏ 1: f <-- Flowest 2: while(f < Fnominal) { 3: Alloc = TryToFindBackfilledAllocation(J,f); 4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and satisfiesPowerLimit(Alloc,J,f)) 5: { schedule(J, Alloc); 6: break; } 7: f <-- nextHigherFrequency } 8: if (f==Fnominal) 9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal); 10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f)) 11: schedule(J, Alloc); } • A job can be scheduled with one of the two functions: the lowest frequency that satisfies it will be selected BSLD Condition power budget must not be violated during entire job execution Power Limit

  28. Workload - # CPUs Jobs(K)‏ Avg BSLD Utilization Over PB CTC – 430 20 - 25 4.66 70% 72% SDSC – 128 40 - 45 24.91 85% 95% SDSCBlue – 1152 20 – 25 5.15 69% 74% LLNLThunder - 4008 20 - 25 1 80% 89% Evaluation • Policy parameters: • Power budget thresholds: Plower= 0.6 , Pupper = 0.9 • BSLD threshold values which have been used: BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower • Power budget • set to 80% of the total CPU power consumed by whole system when running at Fnominal • Four workloads from production use have been simulated:

  29. EEHiPC'10 Baseline Power Budgeting Policy • Power limited without DVFS: • No job will start if it would violate the budget although there are available processors • This case is equal to the EASY scheduling with a smaller machine Arrival of Job 6 CPUs Arrival of Job 5 Arrival of Job 4 Job 4 Job 1 Job 6 Job 2 MakeJobReservation(Job5)‏ Job 4 Job 3 BackfillJob(Job6)‏ Job 5 Job 6 can not start because of power budget

  30. EEHiPC'10 Results: Performance • Oracle case: it is assumed that ß values are known at the scheduling time PB-guided policy shows better performance for all workloads! AVG wait time decreases with DVFS under power constraint

  31. EEHiPC'10 Results: Normalized CPU Energy (idle=0) • Oracle case: it is assumed that ß values are known at the scheduling time

  32. EEHiPC'10 Utilization Over Time

  33. EEHiPC'10 Power Budget Consumed

  34. EEHiPC'10 Comparison of Unknown and Known ß Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values (EASY-backfilling with power limit and without DVFS)

  35. EEHiPC'10 Conclusions • Energy – performance trade-off must be done carefully as DVFS does not affect only job runtime but it can affect significantly job wait time and additionally decrease job performance • Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the wait queue and only the scheduler can estimate potential negative impact on queued jobs • DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty • Parallel job scheduling policies can be designed such that maximizes job performance under a given power constraint • It has been shown that DVFS can improve performance in power constrained HPC centers (using lower CPU frequencies allows more job to run simultaneously)‏ • It is not necessary to know ß values in advance, moreover assuming the worst case at scheduling time can give better performance than when they are known in advance

  36. EEHiPC'10 Thank you for your attention!

More Related