1 / 53

Rescheduling

Rescheduling. Sathish Vadhiyar. Rescheduling Motivation. Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions in response to changes in resource performance is necessary Performance degradation of the running applications

shanna
Download Presentation

Rescheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rescheduling Sathish Vadhiyar

  2. Rescheduling Motivation • Heterogeneity and contention can cause application’s performance vary over time • Rescheduling decisions in response to changes in resource performance is necessary • Performance degradation of the running applications • Availability of “better” resources

  3. Modeling the Cost of Redistribution • Cthreshold depends on: • Model accuracy • Load dynamics of the system

  4. Modeling the Cost of Redistribution

  5. Redistribution Cost Model for Jacobi 2D • Emax – average iteration time of the processor that is farthest behind • Cdev – processor performance deviation variable

  6. Redistribution Cost Model for Jacobi 2D

  7. Experiments • 8 processors were used • A loading event consisting of parallel program was introduced 3 minutes after Jacobi started • Number of tasks of the loading event varied • Cthreshold – 15 seconds

  8. Results

  9. Malleable Jobs • Parallel Jobs • Rigid – only one set of processors • Moldable – flexible during job starts, but cannot be reconfigured during execution • Malleable – flexible during job start as well as during execution

  10. Rescheduling in GrADS • Performance-oriented migration framework • Tightly coupled policies for suspension and migration • Takes into account load characteristics, remaining execution times • Migration of application depends on: • The amount of increase or decrease in loads on the system • The time of the application execution when load is introduced into the system • The performance benefits that can be obtained due to migration Components: • Migrator • Contract Monitor • Rescheduler

  11. SRS Checkpointing Library • End application instrumented with user-level checkpointing library • Enables reconfiguration of executing applications across distinct domains • Allows fault tolerance • Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpoints • Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel application • Simple API - SRS_Init() - SRS_Restart_Value() - SRS_Register() - SRS_Check_Stop() - SRS_Read() - SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create()

  12. SRS INTERNALS MPI Application STOP Poll Runtime Support System (RSS) SRS STOP IBP IBP IBP Read with possible redistribution Start ReStart

  13. /* begin code */ MPI_Init() /* initialize data */ loop{ } MPI_Finalize() /* begin code */ MPI_Init() SRS_Init() restart_value = SRS_Restart_Value() if(restart_value == 0){ /* initialize data */ } else{ SRS_Read(“data”, data, BLOCK, NULL) } SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL) loop{ stop_value = SRS_Check_Stop() if(stop_value == 1){ exit(); } } SRS_Finish() MPI_Finalize() SRS API SRS Instrumented code Original code

  14. SRS Example – Original Code MPI_Init(&argc, &argv); local_size = global_size/size; if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } } MPI_Finalize();

  15. SRS Example – Modified Code MPI_Init(&argc, &argv); SRS_Init(); local_size = global_size/size; restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; } else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

  16. SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); if(stop_value == 1){ MPI_Finalize(); exit(0); } proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } } SRS_Finish(); MPI_Finalize();

  17. Components (Continued..) Contract Monitor: • Monitors the progress of the end application • Tolerance limits specified to the contract monitor • Upper contract limit – 2.0 • Lower contract limit – 0.7 • When it receives the actual execution time for an iteration from the application • calculates ratio between actual and predicted • Adds it to the average ratio • Adds it to the last_5_avg

  18. Contract Monitor • If average ratio > upper contract limit • Contact rescheduler • Request for rescheduling • Receive reply • If reply is “SORRY. CANNOT RESCHEDULE” • Calculate new_predicted_time based on last_5_avg and orig_predicted_time • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit • prev_predicted_time = new_predicted_time

  19. Contract Monitor • If average ratio < lower contract limit • Calculate new_predicted_time based on last_5_avg and orig_predicted_time • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit • prev_predicted_time = new_predicted_time

  20. Rescheduler • A metascheduling service • Operates in 2 modes • When contract monitor requests for rescheduling – i.e. during performance degradation • Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling

  21. Rescheduler Pseudo Code

  22. Rescheduler pseudo Code

  23. Rescheduler pseudo Code

  24. Application and Metascheduler Interactions User Problem parameters Resource Selection Initial list of machines Permission Service Requesting Permission Permission Get new resource information NO Permission? Abort YES Application Specific Scheduling Application specific schedule Contract Negotiator Contract Development Get new resource information Contract Approved? NO YES Problem parameters, final schedule Application Launching Get new resource information Application Completion? Application Completed Wait for restart signal Exit Application was stopped

  25. Rescheduler Architecture Application Manager Get new resource information Application Launching Application Completion? Application Completed Wait for restart signal Exit Application was stopped Application Execution time Contract Monitor Request for migration Rescheduler Query for STOP signal Send STOP signal Runtime Support System (RSS) Database Manager Store STOP Store RESUME

  26. Static Rescheduling Cost

  27. Experiments and ResultsRescheduling on request • Different problem sizes of ScaLAPACK QR • msc – fast machines; opus – slow machines • Initial set of resources consisted of 4 msc and 8 opus machines • The performance model always chose 4 msc machines for application run • 5 minutes into the application run, artificial load is introduced on 4 msc machines • The application migrated from UT to UIUC Rescheduler decided not to reschedule for size 8000.Wrong decision! Rescheduling No rescheduling

  28. Rescheduling Depending on Amount of Load • ScaLAPACK QR problem size – 12000 • Load introduced 20 minutes after application start • The amount of load was varied Rescheduler decided not to reschedule.Wrong decision! No rescheduling Rescheduling

  29. Rescheduling Depending on Load Introduction Time • ScaLAPACK QR problem size – 12000 • Same load introduced at different points of application execution Rescheduler decided not to reschedule.Wrong decision! No rescheduling Rescheduling

  30. Experiments and Results Opportunistic Rescheduling No rescheduling No rescheduling • Two problems – - 1st problem, size 14000 executing on 6 msc machines. - 2nd problem of varying sizes. • 2nd problem introduced 2 minutes after the start of 1st problem. • Initial set of resources for the 2nd problem consisted of 6 msc machines and 2 opus machines. • Due to the presence of 1st problem, the 2nd problem had to use both the msc and opus machines, hence involved Internet bandwidth. • After 1st problem completes, the 2nd problem can be rescheduled to use only the msc machines. No rescheduling No rescheduling Rescheduling Rescheduling Large problem Large problem Large problem Large problem

  31. Dynamic Prediction of Rescheduling Cost • The rescheduler, during rescheduling decision, contacts RSS and obtains data distributions of data • Forms old and new data maps • Based on maps and current NWS information, predicts redistribution cost

  32. Dynamic Prediction of Rescheduling Cost Application started on: 4 mscs Application restarted on: 8 opus

  33. References / Sources / credits • Predicting the Cost of Redistribution in Schedulingby Gary Shao, Rich Wolski and Fran BermanProceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing • Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Framework for the Grid”. Proceedings of  The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan. • L. V. Kale, Sameer Kumar, and J. DeSouzaA Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. • See Cactus migration thorn • See opportunistic migration by Huedo

  34. JUNK !

  35. GridWay • Migration: • When performance degradation happens • When “better” resources are discovered • When requirements change • Owner decision • Remote resource failure • Rescheduling done at discovery interval • Performance degradation evaluator program executed at monitoring interval

  36. Components • Request manager • Dispatch manager • Submission manager – prologing, submitting, canceling, epiloging • Performance monitor • Application specific components • Resource selector • Performance degradation evaluator • Prolog • Wrapper • epilog

  37. Opportunistic Job Migration • Factors • Performance of new host • Remaining execution time of application • Proximity of new resource to the needed data

  38. Dynamic Space sharing on clusters of non-dedicated workstations (Chowdhury et. al.) • Dynamic reconfiguration – application level approach for dynamic reconfiguration of grid-based iterative applications

  39. SRS Overhead Worst case Overhead – 15% Worst case SRS Overhead of all results – 36 %

  40. SRS Data Redistribution Cost Started on – 8 MSCs Restarted on – 8 OPUS, 2MSCs

  41. Modified GrADS Architecture Resource Selector User MDS Grid Routine / Application Manager NWS Permission Service App Launcher Contract Developer Database Manager Performance Modeler RSS Application Contract Monitor Contract Negotiator Rescheduler

  42. Another approach: AMPI • AMPI – MPI implementation on top of Charm++ • Processes implemented as user-level threads • Charm++ provides load balancing framework, migrates threads • The load balancing framework accepts processor map • Parallel job started on all processors in the system • Allocates work to only processors in the processor map, i.e. threads/objects are assigned to processors in the processor map

  43. Rescheduling • When processor map changes • Threads are migrated to new set of processors in the processor map • Skeleton processes left behind in the vacated processors • A skeleton forwards messages to threads/objects previously housed in the processor • New processor conveyed to load balancer framework by adaptive job scheduler

  44. Overhead • Shrink or expand time depends on: • per-process data that has to be transferred • Number of processors involved

  45. Cost of skeleton process

  46. CPU utilization by 2 Jobs

  47. Adaptive Job Scheduler • Variant of dynamic equipartitioning strategy • Each job specifies min. and max. number of procs. that it can run on. • The scheduler recalculates the number of procs. assigned to each running job • Running jobs and new job are first assigned the minimum requirement • The left over procs. are equally divided among all the jobs • The new job is assigned to a queue if it cannot be allocated its minimum requirement

  48. Scheduling • Same strategy followed when jobs complete • The scheduler conveys the decision by bit-vector to jobs • Jobs do thread migration

More Related