Rescheduling
Download
1 / 53

Rescheduling - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

Rescheduling. Sathish Vadhiyar. Rescheduling Motivation. Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions in response to changes in resource performance is necessary Performance degradation of the running applications

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Rescheduling' - shanna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Rescheduling l.jpg

Rescheduling

Sathish Vadhiyar


Rescheduling motivation l.jpg
Rescheduling Motivation

  • Heterogeneity and contention can cause application’s performance vary over time

  • Rescheduling decisions in response to changes in resource performance is necessary

    • Performance degradation of the running applications

    • Availability of “better” resources


Modeling the cost of redistribution l.jpg
Modeling the Cost of Redistribution

  • Cthreshold depends on:

    • Model accuracy

    • Load dynamics of the system



Redistribution cost model for jacobi 2d l.jpg
Redistribution Cost Model for Jacobi 2D

  • Emax – average iteration time of the processor that is farthest behind

  • Cdev – processor performance deviation variable



Experiments l.jpg
Experiments

  • 8 processors were used

  • A loading event consisting of parallel program was introduced 3 minutes after Jacobi started

  • Number of tasks of the loading event varied

  • Cthreshold – 15 seconds



Malleable jobs l.jpg
Malleable Jobs

  • Parallel Jobs

    • Rigid – only one set of processors

    • Moldable – flexible during job starts, but cannot be reconfigured during execution

    • Malleable – flexible during job start as well as during execution


Rescheduling in grads l.jpg
Rescheduling in GrADS

  • Performance-oriented migration framework

  • Tightly coupled policies for suspension and migration

  • Takes into account load characteristics, remaining execution times

  • Migration of application depends on:

    • The amount of increase or decrease in loads on the system

    • The time of the application execution when load is introduced into the system

    • The performance benefits that can be obtained due to migration

Components:

  • Migrator

  • Contract Monitor

  • Rescheduler


Srs checkpointing library l.jpg
SRS Checkpointing Library

  • End application instrumented with user-level checkpointing library

  • Enables reconfiguration of executing applications across distinct domains

  • Allows fault tolerance

  • Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpoints

  • Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel application

  • Simple API

    - SRS_Init()

    - SRS_Restart_Value()

    - SRS_Register()

    - SRS_Check_Stop()

    - SRS_Read()

    - SRS_Finish()

    - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create()


Srs internals l.jpg
SRS INTERNALS

MPI Application

STOP

Poll

Runtime Support

System (RSS)

SRS

STOP

IBP

IBP

IBP

Read with possible redistribution

Start

ReStart


Srs api l.jpg

/* begin code */

MPI_Init()

/* initialize data */

loop{

}

MPI_Finalize()

/* begin code */

MPI_Init()

SRS_Init()

restart_value = SRS_Restart_Value()

if(restart_value == 0){

/* initialize data */

}

else{

SRS_Read(“data”, data, BLOCK, NULL)

}

SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL)

loop{

stop_value = SRS_Check_Stop()

if(stop_value == 1){

exit();

}

}

SRS_Finish()

MPI_Finalize()

SRS API

SRS Instrumented code

Original code


Srs example original code l.jpg
SRS Example – Original Code

MPI_Init(&argc, &argv);

local_size = global_size/size;

if(rank == 0){

for(i=0; i<global_size; i++){

global_A[i] = i;

}

}

MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);

iter_start = 0;

for(i=iter_start; i<global_size; i++){

proc_number = i/local_size;

local_index = i%local_size;

if(rank == proc_number){

local_A[local_index] += 10;

}

}

MPI_Finalize();


Srs example modified code l.jpg
SRS Example – Modified Code

MPI_Init(&argc, &argv);

SRS_Init();

local_size = global_size/size;

restart_value = SRS_Restart_Value();

if(restart_value == 0){

if(rank == 0){

for(i=0; i<global_size; i++){

global_A[i] = i;

}

}

MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);

iter_start = 0;

}

else{

SRS_Read(“A”, local_A, BLOCK, NULL);

SRS_Read(“iterator”, &iter_start, SAME, NULL);

}

SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL);

SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);


Srs example modified code contd l.jpg
SRS Example – Modified Code (Contd..)

for(i=iter_start; i<global_size; i++){

stop_value = SRS_Check_Stop();

if(stop_value == 1){

MPI_Finalize();

exit(0);

}

proc_number = i/local_size;

local_index = i%local_size;

if(rank == proc_number){

local_A[local_index] += 10;

}

}

SRS_Finish();

MPI_Finalize();


Components continued l.jpg
Components (Continued..)

Contract Monitor:

  • Monitors the progress of the end application

  • Tolerance limits specified to the contract monitor

    • Upper contract limit – 2.0

    • Lower contract limit – 0.7

  • When it receives the actual execution time for an iteration from the application

    • calculates ratio between actual and predicted

    • Adds it to the average ratio

    • Adds it to the last_5_avg


Contract monitor l.jpg
Contract Monitor

  • If average ratio > upper contract limit

    • Contact rescheduler

    • Request for rescheduling

    • Receive reply

    • If reply is “SORRY. CANNOT RESCHEDULE”

      • Calculate new_predicted_time based on last_5_avg and orig_predicted_time

      • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit

      • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit

      • prev_predicted_time = new_predicted_time


Contract monitor19 l.jpg
Contract Monitor

  • If average ratio < lower contract limit

    • Calculate new_predicted_time based on last_5_avg and orig_predicted_time

    • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit

    • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit

    • prev_predicted_time = new_predicted_time


Rescheduler l.jpg
Rescheduler

  • A metascheduling service

  • Operates in 2 modes

    • When contract monitor requests for rescheduling – i.e. during performance degradation

    • Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling





Application and metascheduler interactions l.jpg
Application and Metascheduler Interactions

User

Problem parameters

Resource

Selection

Initial list of machines

Permission

Service

Requesting

Permission

Permission

Get new resource information

NO

Permission?

Abort

YES

Application Specific

Scheduling

Application specific schedule

Contract

Negotiator

Contract

Development

Get new resource information

Contract

Approved?

NO

YES

Problem parameters, final schedule

Application

Launching

Get new resource information

Application

Completion?

Application Completed

Wait for restart

signal

Exit

Application was stopped


Rescheduler architecture l.jpg
Rescheduler Architecture

Application Manager

Get new resource information

Application

Launching

Application

Completion?

Application Completed

Wait for restart

signal

Exit

Application was stopped

Application

Execution time

Contract

Monitor

Request for migration

Rescheduler

Query for STOP signal

Send STOP signal

Runtime

Support

System

(RSS)

Database

Manager

Store STOP

Store RESUME



Experiments and results rescheduling on request l.jpg
Experiments and ResultsRescheduling on request

  • Different problem sizes of ScaLAPACK QR

  • msc – fast machines; opus – slow machines

  • Initial set of resources consisted of 4 msc and 8 opus machines

  • The performance model always chose 4 msc machines for application run

  • 5 minutes into the application run, artificial load is introduced on 4 msc machines

  • The application migrated from UT to UIUC

Rescheduler decided not to reschedule for size 8000.Wrong decision!

Rescheduling

No rescheduling


Rescheduling depending on amount of load l.jpg
Rescheduling Depending on Amount of Load

  • ScaLAPACK QR problem size – 12000

  • Load introduced 20 minutes after application start

  • The amount of load was varied

Rescheduler decided not to reschedule.Wrong decision!

No rescheduling

Rescheduling


Rescheduling depending on load introduction time l.jpg
Rescheduling Depending on Load Introduction Time

  • ScaLAPACK QR problem size – 12000

  • Same load introduced at different points of application execution

Rescheduler decided not to reschedule.Wrong decision!

No rescheduling

Rescheduling


Experiments and results opportunistic rescheduling l.jpg
Experiments and Results Opportunistic Rescheduling

No rescheduling

No rescheduling

  • Two problems –

    - 1st problem, size 14000 executing on 6 msc machines.

    - 2nd problem of varying sizes.

  • 2nd problem introduced 2 minutes after the start of 1st problem.

  • Initial set of resources for the 2nd problem consisted of 6 msc machines and 2 opus machines.

  • Due to the presence of 1st problem, the 2nd problem had to use both the msc and opus machines, hence involved Internet bandwidth.

  • After 1st problem completes, the 2nd problem can be rescheduled to use only the msc machines.

No rescheduling

No rescheduling

Rescheduling

Rescheduling

Large problem

Large problem

Large problem

Large problem


Dynamic prediction of rescheduling cost l.jpg
Dynamic Prediction of Rescheduling Cost

  • The rescheduler, during rescheduling decision, contacts RSS and obtains data distributions of data

  • Forms old and new data maps

  • Based on maps and current NWS information, predicts redistribution cost


Dynamic prediction of rescheduling cost32 l.jpg
Dynamic Prediction of Rescheduling Cost

Application started on: 4 mscs

Application restarted on: 8 opus


References sources credits l.jpg
References / Sources / credits

  • Predicting the Cost of Redistribution in Schedulingby Gary Shao, Rich Wolski and Fran BermanProceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing

  • Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Framework for the Grid”. Proceedings of  The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan.

  • L. V. Kale, Sameer Kumar, and J. DeSouzaA Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany.

  • See Cactus migration thorn

  • See opportunistic migration by Huedo



Gridway l.jpg
GridWay

  • Migration:

    • When performance degradation happens

    • When “better” resources are discovered

    • When requirements change

    • Owner decision

    • Remote resource failure

  • Rescheduling done at discovery interval

  • Performance degradation evaluator program executed at monitoring interval


Slide36 l.jpg

  • Components

    • Request manager

    • Dispatch manager

    • Submission manager – prologing, submitting, canceling, epiloging

    • Performance monitor

  • Application specific components

    • Resource selector

    • Performance degradation evaluator

    • Prolog

    • Wrapper

    • epilog


Opportunistic job migration l.jpg
Opportunistic Job Migration

  • Factors

    • Performance of new host

    • Remaining execution time of application

    • Proximity of new resource to the needed data


Dynamic space sharing on clusters of non dedicated workstations chowdhury et al l.jpg
Dynamic Space sharing on clusters of non-dedicated workstations (Chowdhury et. al.)

  • Dynamic reconfiguration – application level approach for dynamic reconfiguration of grid-based iterative applications


Srs overhead l.jpg
SRS Overhead workstations (Chowdhury et. al.)

Worst case Overhead – 15%

Worst case SRS Overhead of all results – 36 %


Srs data redistribution cost l.jpg
SRS Data Redistribution Cost workstations (Chowdhury et. al.)

Started on – 8 MSCs

Restarted on – 8 OPUS, 2MSCs


Modified grads architecture l.jpg
Modified GrADS Architecture workstations (Chowdhury et. al.)

Resource

Selector

User

MDS

Grid

Routine /

Application

Manager

NWS

Permission

Service

App

Launcher

Contract

Developer

Database

Manager

Performance

Modeler

RSS

Application

Contract

Monitor

Contract

Negotiator

Rescheduler


Another approach ampi l.jpg
Another approach: AMPI workstations (Chowdhury et. al.)

  • AMPI – MPI implementation on top of Charm++

  • Processes implemented as user-level threads

  • Charm++ provides load balancing framework, migrates threads

  • The load balancing framework accepts processor map

  • Parallel job started on all processors in the system

  • Allocates work to only processors in the processor map, i.e. threads/objects are assigned to processors in the processor map


Rescheduling45 l.jpg
Rescheduling workstations (Chowdhury et. al.)

  • When processor map changes

    • Threads are migrated to new set of processors in the processor map

    • Skeleton processes left behind in the vacated processors

    • A skeleton forwards messages to threads/objects previously housed in the processor

  • New processor conveyed to load balancer framework by adaptive job scheduler


Overhead l.jpg
Overhead workstations (Chowdhury et. al.)

  • Shrink or expand time depends on:

  • per-process data that has to be transferred

  • Number of processors involved


Cost of skeleton process l.jpg
Cost of skeleton process workstations (Chowdhury et. al.)


Cpu utilization by 2 jobs l.jpg
CPU utilization by 2 Jobs workstations (Chowdhury et. al.)


Adaptive job scheduler l.jpg
Adaptive Job Scheduler workstations (Chowdhury et. al.)

  • Variant of dynamic equipartitioning strategy

  • Each job specifies min. and max. number of procs. that it can run on.

  • The scheduler recalculates the number of procs. assigned to each running job

  • Running jobs and new job are first assigned the minimum requirement

  • The left over procs. are equally divided among all the jobs

  • The new job is assigned to a queue if it cannot be allocated its minimum requirement


Scheduling l.jpg
Scheduling workstations (Chowdhury et. al.)

  • Same strategy followed when jobs complete

  • The scheduler conveys the decision by bit-vector to jobs

  • Jobs do thread migration


Experiments51 l.jpg
Experiments workstations (Chowdhury et. al.)

  • 32 processor Linux cluster

  • Job arrival by Poisson process

  • Each job – a molecular dynamics (MD) program with 50,000 atoms with different number of iterations

  • Number of iterations exponentially distributed

  • Minimum number of procs., minpe – uniformly distributed between 1 and 64

  • maxpe – 64

  • Each experiment – 50 job arrivals


Results52 l.jpg
Results workstations (Chowdhury et. al.)

Load factor – mean arrival rate x (execution time on 64 processors)


Dynamic reconfiguration l.jpg
Dynamic reconfiguration workstations (Chowdhury et. al.)

  • Ability to change number of processors during execution

  • Condor like environment

    • Respect ownerships of workstations

    • Provide high performance for parallel applications

  • Dynamic reconfiguration also provides high throughput for the system