Scheduling Mixed Parallel Applications with Reservations

Scheduling Mixed Parallel Applications with Reservations Henri Casanova Information and Computer Science Dept. University of Hawai`i at Manoa henric@hawaii.edu

Mixed Parallelism procs time • Both task- and data-parallelism • “Malleable tasks with precedence constraints” . . .

Mixed Parallelism • Mixed parallelism arises in many applications, many of them scientific workflows • Example: Image processing applications that apply a graph of data-parallel filters • e.g., [Hastings et al., 2003] • Many workflow toolkits support mixed-parallel applications • e.g., [Stef-Praun et al., 2007], [Kanazawa, 2005], [Hunold et al., 2003]

Mixed-Parallel Scheduling • Mixed-parallel scheduling has been studied by several researchers • NP-hard, with guaranteed algorithms [Lepere et al., 2001][Jansen et al., 2006] • Several heuristics have been proposed in the literature • One-step algorithms [Boudet et al., 2003][Vydyanathan et al., 2006] • Task allocations and task mapping decisions happen concurrently • Two-step algorithms [Radulescu et al., 2001] [Bandala et al., 2006] [Rauber et al., 1998] [Suter et al. 2007] • First, compute task allocations • Second, map tasks to processors using some standard list-scheduling approach

The Allocation Problem • We can give each task very few (one?) processors • We have tasks that run for a long time • But we can do a lot of them in parallel • We can give each task many (all?) processors • We have tasks that run quickly, but typically with diminishing return due to <1 parallel efficiencies • But we can’t run many tasks in parallel • Trade-off: parallelism and task execution times • Question: How do we achieve a good trade-off?

Critical Path and Work total work = sum of rectangle surfaces critical path length = execution time of the longest path in the DAG • Two constraints: • Makespan * #procs > total work • Makespan > critical path length processors time

Work vs. CP Trade-off best lower bound on makespan total work / # procs critical path small large task allocations

The CPA 2-Step Algorithm • Original Algorithm [Radulescu et al., 2001] • For a homogeneous platform • Start by allocating 1 processor to all tasks • Then pick a task and increase its allocation by 1 processor • Picking the task that benefits the most from one extra processor, in terms of execution time • Repeat until the critical path length and the total work / # procs become approximately equal • Improved Algorithm [Suter et al., 2007] • Uses an empirically better stopping criterion

Presentation Outline • Mixed-Parallel Scheduling • The Scheduling Problem with Reservations • Models and Assumptions • Algorithms for Minimizing Makespan • Algorithms for Meeting a Deadline • Conclusion

Batch Scheduling and Reservations • Platforms are shared by users, today typically by batch schedulers • Batch schedulers have known drawbacks • non-deterministic queue waiting times • In many scenarios, one needs guarantees regarding application completion times • As a result, most batch schedulers today support advance reservations: • One can acquire reservations for some number of processors and for some period of time

Reservations We have to schedule around the holes in the reservation schedule processors time

Reservations One reservation per task processors time

Complexity • The makespan minimization problem is NP-hard at several levels (and thus also for meeting a deadline) • Mixed-parallel scheduling is NP-hard • Guaranteed algorithms [Lepère et al., 2001] [Jansen et al., 2006] • Scheduling independent tasks with reservations is NP-hard and unapproximable in general [Eyraud-Dubois et al., 2007] • Guaranteed algorithms with restrictions • Guaranteed algorithms for mixed-parallel scheduling with reservations are open • In this work we focus on developing heuristics

Models and Assumptions • Application • We assume that the application is fully specified and static • Conservative reservations can be used to be safe • Random DAGs are generated using the method in [Suter et al., 2007] • Data-parallelism is modeled based on Amdahl’s law • Platform • We assume that the reservation schedule does not change while we compute the schedule • We assume that we know the reservation schedule • Sometimes not enabled by cluster administrators • We ignore communication between tasks • Since a parent task may complete well before one of its children can start, data must be written to disk anyway • Can be modeled via task execution time and/or Amdahl’s law parameter

Minimizing Makespan • Natural approach: adapt the CPA algorithm • It’s a simple algorithm: • First phase: compute allocations • Second phase: list-scheduling • Problem: • Allocations are computed without considering reservations • Considering reservations would involve considering time, which is only done in the second phase • Greedy Approach: • Sort the tasks by decreasing bottom-level • For each task in this order, determine the best feasible processor allocation • i.e., the one that has the earliest completion time

Example D C A B B A C possible task configurations: D processors B time

Computing Bottom-Levels • Problem: • Computing bottom levels (BL) requires that we know task execution times • Task execution times depend on allocations • But we compute the allocations after using the bottom levels • We compare four ways to compute BLs • use 1-processor allocations • use “all”-processor allocations • use CPA-computed allocations, using all processors • use CPA-computed allocations, using historical average number of non-reserved processors • We find that the 4th method is marginally better • wins in 78.4% of our simulations (more details on simulations later) • All results hereafter use this method for computing BLs

Bounding Allocations • A known problem with such a greedy approach is that allocations are too large • reduction in parallelism ends up being detrimental to makespan • Let’s try to bound allocations • Three methods • BD_HALF: bound to half of the processors • BD_CPA: bound by allocations in the CPA schedule computed using all processors • BD_CPAR: bound by allocations in the CPA schedule computed using the historical average number of non-reserved processors

Reservation Schedule Model? • We conduct our experiments in simulation • cheap, repeatable, controllable • We need to simulate environments for given reservation schedules • Question: what does a typical reservation schedule look like? • Answer: we don’t really know yet • There is no “reservation schedule” archive • Let’s look at what people have done in the past...

Synthetic Reservation Schedules • We have schedules of batch jobs • e.g., “parallel workload archive”, by D. Feitelson • Typical approach, e.g., in [Smith et al., 2000] • Take a batch job schedule • Mark some jobs as “reserved” • Remove all other jobs • Problem: the amount of reservation is approximately constant, while in the real world we expect it to be approximately decreasing • And we see it to behave in this way in a real-world 2.5-year trace from the Grid5K platform • We should generate reservation schedules where the amount of reservation decreases with time

Synthetic Reservation Schedules • Three methods to “drop” reservations after the simulated application start time • Linearly or exponentially • so that there are no reservations after 7 days • Based on job submission time • Preliminary evaluations indicate that the exponential method leads to schedules that are more correlated to the Grid5K data • For 4 logs from the “parallel workload archive” • But this is not conclusive because we have only one (good) data set at this point • We run simulations with 4 logs, the 3 above methods, and with the Grid5K data • Bottom-line for this work: we do not observe discrepancies in our results for our purpose regarding any of the above

Simulation Procedure • We use 40 application specifications • DAG size, width, regularity, etc. • 20 samples • We use 36 reservation schedule specifications • batch log, generation method, etc. • 50 samples • Total: 1,440 x 1,000 = 1,440,000 experiments • Two metrics: • Makespan • CPU-hour consumptions

Simulation Results • Similar results for Grid5K reservation schedules

Meeting a Deadline • A simple approach for meeting a deadline is to simply schedule backwards from the deadline • Picking tasks by increasing bottom-levels • The way to be as safe as possible is to find for each task the feasible allocation that starts as late as possible given that: • The exit task must complete before the deadline • The task must complete before all of its children begin • Let’s see this on a simple example

Meeting a Deadline Example procs time possible Task 1 configurations A B C Task 1 D E possible Task 2 configurations A Task 2 C D B E

Meeting a Deadline Example A C D B deadline E A processors time

Meeting a Deadline Example A C D B deadline E processors B time

Meeting a Deadline Example A C D B deadline E processors C time

Meeting a Deadline Example A C D B deadline E D processors time

Meeting a Deadline Example A C D B deadline E processors E time

Meeting a Deadline Example A C D B deadline E processors Task 2 time

Meeting a Deadline Example A B C deadline D E A processors Task 2 time

Meeting a Deadline Example A B C deadline D E B processors Task 2 time

Meeting a Deadline Example A B C deadline D E C processors Task 2 time

Meeting a Deadline Example A B C deadline D E processors D Task 2 time

Meeting a Deadline Example A B C deadline D E processors E Task 2 time

Meeting a Deadline Example A B C deadline D E Task 1 processors Task 2 time

Algorithms • We can employ the same techniques for bounding allocations as for the makespan minimization algorithms • BD_ALL, BD_HALF, BD_CPA, BD_CPAR • Problem: the algorithms do not consider the tightness of the deadline • If the deadline is loose, the above algorithms will consume unnecessarily high numbers of CPU-hours • For a very loose deadline there should be no data-parallelism, and thus no parallel efficiency loss due to Amdahl’s law • Question: How can we reason about deadline tightness?

Deadline Tightness • For each task we have a choice of allocations: • Ones that use too many processors may be wasteful • Ones that use too few processors may be dangerous • Idea: • Consider the CPA-computed schedule assuming an empty reservation schedule • Using all processors, or the historical average number of non-reserved processors • Determine when the task would start in that schedule, i.e., at which fraction of the overall makespan • Pick the allocation that allows the task to start at the same fraction of the time interval between “now” and the deadline

Matching the CPA schedule CPA Schedule processors q procs time a b

Matching the CPA schedule CPA Schedule processors q procs time a b c d Schedule with Reservation p processors time task “deadline”

Matching the CPA schedule CPA Schedule processors q procs time a b Pick the cheapest allocation such that: b / (a+b) > d / (c+d) c d Schedule with Reservation p processors time task “deadline”

Simulation Experiments • We call this new approach “resource conservative” (RC) • We conduct simulation similar to those for the makespan minimization algorithms • Issue: the RC approach can be in trouble when it tries to schedule the first tasks • if the reservation schedule is non-stationary and/or tight • could be addressed via some tunable parameter (e.g., pick an allocation that starts at least x% after the scaled CPA start time) • We do not use such a parameter in our results • We use two metrics: • Tightest deadline achieved • Necessary because deadline tightness depends on instance • Determined via binary search • CPU-hours consumption for a deadline that’s 50% later than the tightest deadline

Simulation Results

Conclusions • Makespan minimization • Bounding task allocations based on the CPA schedule works well • Meeting a deadline • Using the CPA schedule for determining task start times works well, at least when the reservation schedule isn’t to tight • Some tuning parameter may help for tight schedules • Or, one can use the same approach as for makespan minimization but backwards • In both cases using the historical number of unreserved processors leads to marginal improvements

Possible Future Directions • Use a recent one-step algorithm instead of CPA • iCASLB [Vydyanathan, 2006] • Experiments in a real-world setting • What kind of interface should a batch scheduler expose if the full reservation schedule must remain hidden? • Reservation schedule archive • Needs to be a community effort

Scheduling Mixed-Parallel Applications with Advance Reservations, Kento Aida and Henri Casanova, to appear in Proc. of HPDC 2008 Questions?

Scheduling Mixed Parallel Applications with Reservations

Scheduling Mixed Parallel Applications with Reservations

Presentation Transcript

Project P3-B MAC Scheduling and Reservations with TCP

Parallel Application Memory Scheduling

Building Lean Systems Mixed Model Scheduling

Production Scheduling: operations scheduling with applications in manufacturing and services

scheduling theory for mixed-criticality systems

Scheduling of parallel processes

Scheduling Generic Parallel Applications –Meta-scheduling

Parallel Machine Scheduling

Power-Aware Parallel Job Scheduling

Virtual Machine Scheduling for Parallel Soft Real-Time Applications

Introductory Parallel Applications

Delivering High Performance to Parallel Applications Using Advanced Scheduling

Optoelectronic Parallel Computing with its Applications

Mixed Criteria Packet Scheduling

Applications of SRPT Scheduling with Inaccurate Information

Scheduling on Parallel Systems

3 Parallel Applications

PARALLEL APPLICATIONS

Production Scheduling: operations scheduling with applications in manufacturing and services

Scheduling on Parallel Systems

Scaling Parallel Applications