Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects Pitfalls in Parallel Job Scheduling Evaluation Eitan Frachtenberg and Dror Feitelson Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world

Scope • Numerous methodological issues occur with the evaluation of parallel job schedulers: • Experiment theory and design • Workloads and applications • Implementation issues and assumptions • Metrics and statistics • Paper covers 32 recurring pitfalls, organized into topics and sorted by severity • Talk will describe a real case study, and the heroic attempts to avoid most such pitfalls …as well as the less-heroic oversight of several others

Evaluation Paths • Theoretical Analysis (queuing theory): • Reproducible, rigorous, and resource-friendly • Hard for time slicing due to unknown parameters, application structure, and feedbacks • Simulation: • Relatively simple and flexible • Many assumptions, not all known/reported; hard to reproduce; rarely factors application characteristics • Experiments with real sites and workloads: • Most representative (at least locally) • Largely impractical and irreproducible • Emulation

Emulation Environment • Experimental platform consisting of three clusters with high-end network • Software: several job scheduling algorithms implemented on top of STORM: • Batch / space sharing, with optional EASY backfilling • Gang Scheduling, Implicit Coscheduling (SB), Flexible Coscheduling • Results described in [JSSPP’03] and [TPDS’05]

Step One: Choosing Workload • Static vs. Dynamic • Size of workload • How many different workloads are needed? • Use trace data? • Different sites have different workload characteristics • Inconvenient sizes may require imprecise scaling • “Polluted” data, flurries • Use model-generated data? • Several models exist, with different strengths • By trying to capture everything, may capture nothing

Static Workloads • We start with a synthetic application & static workloads • Simple enough to model, debug, and calibrate • Bulk-synchronous application • Can control: granularity, variability and Communication pattern

Synthetic Scenarios Balanced ComplementingImbalancedMixed

Example: Turnaround Time

Dynamic Workloads • We chose Lublin’s model [JPDC’03] • 1000 jobs per workload • Multiplying run-times AND arrival times by constant to “shrink” run time (2-4 hours) • Shrinking too much is problematic (system constants) • Multiplying arrival times by a range of factors to modify load • Unrepresentative, since deviates from “real” correlations with run times and job sizes. • Better solution is to use different workloads

Step Two: Choosing Applications • Synthetic applications are easy to control, but: • Some characteristics are ignored (e.g., I/O, memory) • Others may not be representative, in particular communication, which is salient of parallel apps. • Granularity, pattern, network performance • If not sure, conduct sensitivity analysis • Might be assumed malleable, moldable, or with linear speedup, which many MPI applications are not • Real applications have no hidden assumptions • But may also have limited generality

Example: Sensitivity Analysis

Application Choices • Synthetic applications on first set • Allows control over more parameters • Allows testing unrealistic but interesting conditions (e.g., high multiprogramming level) • LANL applications on second set (Sweep3D, Sage) • Real memory and communication use (MPL=2) • Important applications for LANL’s evaluations • But probably only for LANL… • Runtime estimate: f-model on batch, MPL on others

Step Three: Choosing Parameters • What are reasonable input parameters to use in the evaluation? • Maximum multiprogramming level (MPL) • Timeslice quantum • Input load • Backfilling method and effect on multiprogramming • Run time estimate factor (not tested) • Algorithm constants, tuning, etc.

Example 1: MPL • Verified with different offered loads

Example 2: Timeslice • Dividing to quantiles allows analysis of effect on different job types

Considerations for Parameters • Realistic MPLs • Scaling traces to different machine sizes • Scaling offered load • Artificial user estimates and multiprogramming estimates

Step Four: Choosing Metrics • Not all metrics are easily comparable: • Absolute times, slowdown with time slicing, etc. • Metrics may need to be limited to a relevant context • Use multiple metrics to understand characteristics • Measuring utilization for an open model • Direct measure of offered load till saturation • Same goes for throughput and makespan • Better metrics: slowdown, response time, wait time • Using mean with asymmetric distributions • Inferring scalability from O(1) nodes

Example: Bounded Slowdown

Example (continued)

Response Time

Bounded Slowdown

Step Five: Measurement • Never measure saturated workloads • When arrival rate is higher than service rate, queues grow to infinity; all metrics become meaningless • …but finding saturation point can be tricky • Discard warm-up and cool-down results • May need to measure subgroups separately (long/short, day/night, weekday/weekend,…) • Measurement should still have enough data points for statistical meaning, especially workload length

Example: Saturation Point

Example: Shortest jobs CDF

Example: Longest jobs CDF

Conclusion • Parallel Job Scheduling Evaluation is complex • …but we can avoid past mistakes • Paper can be used as a checklist to work with when designing and executing evaluations • Additional information in paper: • Pitfalls, examples, and scenarios • Suggestions on how to avoid pitfalls • Open research questions (for next JSSPP?) • Many references to positive examples • Be cognizant when Choosing your compromises

References • Workload archive: http://www.cs.huji.ac.il/~feit/worklad Contains several workload traces and models • Dror’s publication page http://www.cs.huji.ac.il/~feit/pub.html • Eitan’s publication page http://www.cs.huji.ac.il/~etcs/pubs • Email: eitanf@lanl.gov

Designing Parallel Operating Systems using Modern Interconnects