slide1
Download
Skip this Video
Download Presentation
Power-Aware Parallel Job Scheduling

Loading in 2 Seconds...

play fullscreen
1 / 36

Power-Aware Parallel Job Scheduling - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Power-Aware Parallel Job Scheduling. Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero. {maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es. Power Consumption of Supercomputing Systems.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Power-Aware Parallel Job Scheduling' - lotta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Power-Aware Parallel Job Scheduling

Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero

{maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es

power consumption of supercomputing systems
EEHiPC'10Power Consumption of Supercomputing Systems
  • Striving for performance has led to enormous power dissipation of HPC centers (Top500 list)

KWatts

power reduction approaches in hpc
EEHiPC'10Power reduction approaches in HPC

Power reduction approaches

Application level:

- Runtime systems:

- exploit certain application

characteristics (load imbalance,

communication intensive

regions)

- based on very fine grain

DVFS application

System level:

- Turning off idle nodes:

- resource allocation such that

there are more completely

idle nodes

- determining number of online

nodes

- Operating system power

management via DVFS:

- linux governors – per core,

unawareness of the rest

of the system

- DVFS taking into the account

entire system workload?

parallel job scheduling
EEHiPC'10 Parallel Job Scheduling

Wait Queue

Queued jobs

  • Job scheduler has a global view of the whole system

Job submission

HPC Job Scheduler

Job with its requirements

Job Scheduling

Resource Manager

slide5
EEHiPC'10

DVFS and Job Scheduling

Wait Queue

Queued jobs

Job submission

HPC Job Scheduler

Job with its requirements

Job Scheduling

Resource

Manager

Power-Aware

Component

Job CPU frequency assignment

based on goals/constraints

outline
EEHiPC'10Outline
  • Parallel job scheduling:
    • short introduction to parallel job scheduling
    • the EASY backfilling policy
  • Power and run time modelling:
    • first we need to understand how frequency scaling affects CPU power dissipation and runtime
  • Energy-saving parallel job scheduling policies:
    • Utilization-driven power-aware scheduling

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010]

    • BSLD-driven power-aware scheduling

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops 2010 Atlanta, GA, April 2010]

  • Power-budgeting:
    • how to maximize job performance under a given power budget?

[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010]

about parallel job scheduling
EEHiPC'10About Parallel Job Scheduling

CPUs

  • Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled:
  • FCFS policy used in the beginning
  • Backfilling policies introduced to improve system utilization
  • Job performance metrics:
    • Response time: WaitTime(J)+RunTime(J)
    • Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J)
    • Bounded Slowdown:

max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1)

Job 5

Job 3

Job 2

Job 1

Job 4

Job 6

Time

Job Performance

Wait Time

Run Time

the easy backfilling policy
EEHiPC'10The EASY backfilling policy
  • Jobs are executed in FCFS order except when the first job in the wait queue can not start
  • Users have to submit an estimation of job's runtime – requested time
  • When the first job in the WQ can not start, a reservation is made for it based on requested times of running jobs
  • A job is executed before previously arrived ones only if it does not delay the first job in the wait queue

CPUs

Arrival of Job 5

Arrival of Job 6

Job 5

Job 1

Job 2

MakeJobReservation(Job5)

Job 5

Job 3

BackfillJob(Job6)

Job 4

Job 6

Time

power model
EEHiPC'10Power Model
  • CPU power presents one of main system power components
  • It consists of dynamic and static power:
    • Pcpu = Pdynamic + Pstatic
    • Pdynamic = AcfV2 Pstatic = α V
  • Fraction of static in total CPU power is a model parameter:
    • Pstatic(Vtop) = X(Pstatic(Vtop) + Pdynamic (ftop,Vtop))

( X = 25% in our experiments )

  • Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity)
  • Idle processors: do not consume power/ consume power at the lowest frequency
  • DVFS gear set :
time model
EEHiPC'10

Number of CPUs

Distribution

Less or equal to 4

N(0.5, 0.01)

Between 4 and 32

N(0.4, 0.01)

More than 32

N(0.3, 0.064)

Time Model
  • Execution time dependence on frequency is captured by the following model:

F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1

[Hsu,Feng SC05: A Power-Aware Run Time System for High-Performance Computing]

  • ß is assumed to have the following normal distributions:
  • Global application ß depends on communication/computation ratio
  • Two ß scenarios:
    • ß is known in advance (at the moment of scheduling)
    • ß is not known in advance (at the moment of scheduling the worst case, ß = 1, is assumed )

ß=0.7

ß=0.5

ß=0.3

utilization driven policy
EEHiPC'10Utilization-Driven Policy
  • Frequency assigned once (at jobs start time) for entire job execution based on system utilization
  • Utilization is computed for each interval T:
  • An additional control over system load WQthreshold:
    • If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied
    • Otherwise, job started during interval Jkruns at frequency F

ftop

Fk

fupper

flower

Uk-1

Ulower

Uupper

evaluation
EEHiPC'10Evaluation

Alvio simulator

  • C++ event driven parallel job scheduling simulator has been upgraded
  • Policy parameters:

utilization thresholds: Ulower= 50% Uupper= 80%

reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz

utilization computation interval: T = 10 min

wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit

  • Metric of job performance – Bounded Slowdown:
  • BSLD at frequency f :

Policy parameters

Metric of performance

workloads
EEHiPC'10Workloads
  • Five workloads from production use have been simulated:

Cornell Theory Center

-large jobs with relatively

low level of parallelism

San Diego Supercomputing

Center

-less sequential jobs than CTC

-runtime distribution similar

Lawrence Livermore National

Lab

- small to medium size jobs

Lawrence Livermore National

Lab

- large parallel jobs

San Diego Supercomputing

Center

- no sequential job

Parallel workload archive

http://www.cs.huji.ac.il/labs/parallel/workload

results normalized cpu energy
EEHiPC'10Results: Normalized CPU Energy

short wait queues

very similar results for both energy scenarios

savings of not highly loaded workloads up to 12%

results normalized performance
EEHiPC'10Results: Normalized Performance

high penalty in the least conservative case for highly loaded workload

WQ threshold has almost no impact

an increase in number of backfilled jobs

bsld driven policy
EEHiPC'10BSLD-Driven Policy
  • Frequency is assigned based on job's predicted performance
  • Lower frequency -> longer execution time
  • -> worse job performance metric
  • BSLDth controls allowable performance penalty (“target BSLD”)
  • In order to be run at lower frequency f a job has to satisfy BSLD condition at frequency f: if the job's predicted BSLD at frequency f is lower thanBSLDththan it satisfies the BSLD condition at frequency f
  • Predicted BSLD:

Job Ji

WQsize≤ WQthreshold

NO

Run Jiat Ftop

YES

f = Flowest

find an allocation Alloc

satisfiesBSLD(Alloc,Ji,f) or

f=Ftop

NO

f = next higher frequency

YES

Run job Ji at frequency f

results normalized cpu energy1
EEHiPC'10 Results: Normalized CPU Energy

Normalized energies in two energy scenarios behave in the same way

Average savings in the most aggressive case: 5% - 23%

Difference in savings per workload for the most conservative and the most aggressive threshold combinations goes from 5% (SDSC) to 15% (LLNLThunder)

WQthresholdcontrols DVFS aggressiveness much better than BSLDthreshold

BSLDthresholdhas stronger impact when WQthresholdishigher

average bsld
EEHiPC'10Average BSLD

24.91

Strong impact on performance in the most aggressive case

Impact of WQthresholdhigher than of BSLDthreshold

BSLDthresholdhas stronger impact when WQthreshold is higher

1

1.08

5.15

4.66

Decrease in performance is proportional to energy savings

reduced jobs out of 5000
EEHiPC'10Reduced jobs (out of 5000)‏

Performance depends on the number of reduced jobs

It depends on used frequencies as well

It was remarked that performance of jobs that have been run at the nominal frequency was affected as well

When load is very high (SDSC) no DVFS is applied

(in order to apply it thresholds have to be set to higher values)

wait time
EEHiPC'10Wait time

Main problem observed:

-> high impact on wait time

Zoom of SDSCBlue wait time behavior

pb guided policy how dvfs can improve overall job performance
EEHiPC'10PB-Guided Policy: How DVFS can improve overall job performance

NO

DVFS

CASE

J3

J5

ftop

J1

Wait Queue:

J2

J4

J5

Time

T1

T2

ftop

J4

J5

J3

DVFS

CASE

J4

Power Budget

flower

J2

J3

J1

J2

J1

penalty in run time due to frequency scaling but more jobs can run simultaneously

power budgeting pb guided policy
EEHiPC'10Power Budgeting: PB-Guided Policy
  • Frequency assignment is guided by predicted job performance and current power draw
  • Prediction of BSLD when selecting frequency:
  • BSLD condition:

A job satisfies BSLD condition at reduced frequency f if its predicted BSLD

at the frequency f is lower than current value of the BSLD threshold

  • The policy is power conservative:
  • A job will be scheduled at the lowest frequency at which both BSLD condition and power limit are satisfied
  • The closer to the PB limit, the
  • higher the BSLD threshold
  • The higher the BSLD threshold,
  • the lower frequency will be selected

BSLD threshold

Pcurrent

0

Plower

Pupper

Power Budget

power budgeting pb guided policy1
EEHiPC'10Power Budgeting: PB-Guided Policy

MakeJobReservation(J)‏

1: scheduled <-- false;

2: shiftInTime <-- 0;

3: nextFinishJob <-- next(OrderedRunningQueue);

4: while( !scheduled)‏

{

5: f <-- FlowestReduced

6: while(f < Fnominal)

{

7: Alloc = findAllocation(J,currentTime + shiftInTime,f);

8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) )

9: { schedule(J, Alloc);

10: scheduled <-- true;

11: break; }

12: if (f == Fnominal)

13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal)

14: if (satisfiesPowerLimit(Alloc, J,Fnominal))

15: schedule(J, Alloc);

16: break;

17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime;

18: nextFinishJob <-- next(OrderedRunningQueue);

}

BackfillJob(J)‏

1: f <-- Flowest

2: while(f < Fnominal)

{

3: Alloc = TryToFindBackfilledAllocation(J,f);

4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and satisfiesPowerLimit(Alloc,J,f))

5: { schedule(J, Alloc);

6: break; }

7: f <-- nextHigherFrequency

}

8: if (f==Fnominal)

9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal);

10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f))

11: schedule(J, Alloc); }

  • A job can be scheduled with one of the two functions:

the lowest frequency that satisfies it will be selected

BSLD Condition

power budget must not be violated during entire job execution

Power Limit

evaluation1
Workload - # CPUs

Jobs(K)‏

Avg BSLD

Utilization

Over PB

CTC – 430

20 - 25

4.66

70%

72%

SDSC – 128

40 - 45

24.91

85%

95%

SDSCBlue – 1152

20 – 25

5.15

69%

74%

LLNLThunder - 4008

20 - 25

1

80%

89%

Evaluation
  • Policy parameters:
    • Power budget thresholds: Plower= 0.6 , Pupper = 0.9
    • BSLD threshold values which have been used:

BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower

  • Power budget
    • set to 80% of the total CPU power consumed by whole system when running at Fnominal
  • Four workloads from production use have been simulated:
baseline power budgeting policy
EEHiPC'10Baseline Power Budgeting Policy
  • Power limited without DVFS:
    • No job will start if it would violate the budget although there are available processors
  • This case is equal to the EASY scheduling with a smaller machine

Arrival of Job 6

CPUs

Arrival of Job 5

Arrival of Job 4

Job 4

Job 1

Job 6

Job 2

MakeJobReservation(Job5)‏

Job 4

Job 3

BackfillJob(Job6)‏

Job 5

Job 6

can not start because of power budget

results performance
EEHiPC'10 Results: Performance
  • Oracle case: it is assumed that ß values are known at the scheduling time

PB-guided policy shows better performance for all workloads!

AVG wait time decreases with DVFS under power constraint

results normalized cpu energy idle 0
EEHiPC'10 Results: Normalized CPU Energy (idle=0)
  • Oracle case: it is assumed that ß values are known at the scheduling time
comparison of unknown and known
EEHiPC'10Comparison of Unknown and Known ß

Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values

(EASY-backfilling with power limit and without DVFS)

slide35
EEHiPC'10

Conclusions

  • Energy – performance trade-off must be done carefully as DVFS does not affect only job runtime but it can affect significantly job wait time and additionally decrease job performance
  • Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the wait queue and only the scheduler can estimate potential negative impact on queued jobs
  • DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty
  • Parallel job scheduling policies can be designed such that maximizes job performance under a given power constraint
  • It has been shown that DVFS can improve performance in power constrained HPC centers (using lower CPU frequencies allows more job to run simultaneously)‏
  • It is not necessary to know ß values in advance, moreover assuming the worst case at scheduling time can give better performance than when they are known in advance
slide36
EEHiPC'10

Thank you for your attention!

ad