1 / 17

Scheduling in HPC Resource Management System: Queuing vs. Planning

Scheduling in HPC Resource Management System: Queuing vs. Planning. Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies for Parallel Processing (JSSPP) Workshop Jerry Chou 8/29/2005. Outline. Background Queuing and Planning Systems

anevay
Download Presentation

Scheduling in HPC Resource Management System: Queuing vs. Planning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies for Parallel Processing (JSSPP) Workshop Jerry Chou 8/29/2005

  2. Outline • Background • Queuing and Planning Systems • Advanced Planning Functions • Example: Computing Center Software • Conclusion • Discussion

  3. Background • HPC systems are operated by resource management systems (RMS) based on the queuing approach • PBS, SGE, Loveleveler, etc… • Grid middleware emerges between resource management systems and applications • Globus, vgES, etc • High level function (co-allocation) needs features from RMS • Advanced reservation, quality of service • It is hard to realize those features with RMS because it only consider present resource usage => This paper purpose planning system to close the gap

  4. Big Picture Application Co-allocation Grid Middleware Globus vgES Advanced Reservation QoS RMS (PBS) RMS (Loadleveler) RMS (SGE) RMS (Condor) Resources

  5. Queuing and Planning Systems • Queuing Systems • Planning Systems • Queuing vs. Planning Systems

  6. Queuing Systems • Queues have different limits on the resource requests • Number of resources requested • Execution time • Interactive/Batch jobs • Jobs are sorted by schedule policy in the queue • The highest priority request is the queue head • If more than one queue can be started, further criteria are needed, such as Queue priority • If no queue head can be started, the idle resources may be utilized with backfilling

  7. Planning Systems - Replanning • Requested • Start time • Estimated run time • When • A new request is submitted • A running request ends before it’s estimated end time • How • Delete all non-reservations from schedule • Sort non-reservations according to schedule policy • Arrange reservations into schedule • Insert non-reservations in the schedule at the earliest possible start time

  8. Queuing vs. Planning Systems

  9. Advanced Planning Functions • Requesting Resources • Dynamic Aspects • Service Level Agreements

  10. Requesting Resources • Diffuse requests • Give a range: “need 32~128 CPUs” • Let RMS optimizes: “need as much nodes as possible” • Negotiation

  11. Dynamic Aspects • Variable Reservations • Make a reservation ASAP • Different from reserved jobs: • No fix start time • Different from non-reserved jobs: • Never planed later than its first planned start time • Resource Reclaiming • Replace requested resources at run time • Automatic Duration Extension • Extend the runtime of jobs while they are running • How long can it be extended • Hoe many time it can be extended

  12. Dynamic Aspects (Cont.) • Automatic Restart • It can utilize short time slots in the scheduling • Space sharing “Cycle Stealing” • Run as a background job to steal resources in a space sharing system (like condor) • Deployment Servers • RMS plans both the requested resources and the time to reconfigure the hardware

  13. Service Level Agreements (SLA) • SLA has to be considered not only in the scheduling process but also during the runtime • At runtime the scheduler is not responsible for measuring the fulfillment of the SLA, but to provide all granted resources

  14. Computing Center Software (CCS) • Architecture • User Interface (UI): provide single access point to one or more systems • Access Manager (AM): manages the user interface and is responsible for authentication, authorization and accounting • Planning Manager (PM): plans the user requests onto the machine • Machine Manager (MM): provides machine specific feature • Island Manager (IM): provide CCS internal services and watchdog facilities to keep the island in a stable condition

  15. Process Flow User: specify the expected duration of their requests Requests • PM: re-plans the schedule • Fix-time Request: request reserves resource for a given time • Var-time Request: can move to a earlier time slot when replanning Schedule MM: maps schedule to machines Verify if a schedule can be realized with the available hardware. No Yes Find alternative time Send conflict list to PM Conflict List No Done Can PM accept? Yes

  16. Conclusion • Classify and compare queuing systems with planning systems • Present possible advanced planning functionality • The aim of the paper is to show the benefit of planning systems for managing HPC machines

  17. Discussion • Does planning system solve all the problem? • What if most of jobs want to run ASAP • What if runtime is not estimated precisely • What’s the performance and utilization comparison between queuing systems and planning systems • If you are resource provider, will you use it? • What feature could be provided by vgES? • Diffuse requests • Resource reclaiming • Variable reservation • Negotiation

More Related