Hybrid Preemptive Scheduling of MPI applications

Aurélien Bouteiller, Hinde Lilia Bouziane, Thomas Hérault, Pierre Lemarinier, Franck Cappello MPICH-Vteam INRIA Grand-Large LRI, University Paris South Hybrid Preemptive Scheduling of MPI applications 1

Problem definition • Context: Clusters and Grids (made of clusters) shared by many users (less available resources than required at a given time) In this study : finite sets of MPI applications. • Time sharing of parallel applications is attractive to increase fairness between users, compared to Batch scheduling • It is very likely that several applications will reside in the virtual memory at the same time, exceeding the total physical memory  Out-of-core scheduling of parallel applications on clusters! (scheduling // applications on cluster under mem. constraint) • Most of the proposed approaches tries to avoid this situation (by limiting job admission based on mem. requirement, delaying some jobs unpredictably if the jobs exec. time is not known) Issue: Novel approach (out-of-core) that avoid delaying some jobs? Constraint: No OS modification (no kernel patch) 2

Outline • Introduction (related work) • A Hybrid approach dedicated to out-of-core • Evaluation • Concluding remarks 3

Proc 1 Appl 1 Appl 3 Appl 2 Appl 1 Appl 2 Appl 3 Proc 2  Expected advantage: overlapping comm. And comp. Time slice Proc 1 Proc 2 Communication Scheduling overhead  Expected advantage: scheduling communicating processes Related work 1 Scheduling parallel applications on distributed memory machines: a long history of research, still very active (5 papers in 2004 in main conferences: IPDPS, Cluster, SC, Grid, Europar)! Time Co-scheduling: all processes of each application are scheduled independently (no coordination) Gang-scheduling: all processes of each application are executed simultaneously (coordination) Time sometimes called “co-scheduling” 4

Related work 2 Comparison between Gang and Co scheduling: Gang scheduling out-performs co-scheduling • D. G. Feitelson and L. Rudolph. Gang Scheduling Performance Benefits for Fine-Grained Synchronization. Journal of Parallel and Distributed Computing, 16(4):306–318, December 1992. Co-scheduling out-performs gang scheduling • Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003(Gang schedule only applications that take advantage of if – after classification) • Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das, “Coscheduling in Clusters: Is It a Viable Alternative?”, SC2004 (increases priority of processes during communications) • Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, Cluster 2004. (Ethernet, Score for Gang, MatMul and Linpack) Multiple parameter problem Conclusion depends on assumptions ! 5

Related work 3 Metrics for measuring performance: Metrics and Benchmarking for Parallel Job Scheduling [Fe98] Performance • Makespan • Throughput • Response time Not so much investigated. Still very important: Fairness • Standard deviation of the response time for a set of homogeneous applications • Minimum is the best fairness 6

A given set of parallel applications to schedule • Application SubSets: a set of applications fitting in memory • Co-scheduling applications within a SubSet • Gang scheduling SubSets of applications Principle: • Overlapping Comm. and I/O with computation in a SubSet • No memory page miss/replacement during Subset exec. • Allow known Co-scheduling optimizations in SubSet Expected benefits: Potential limitation: • High “Subset context”switching overhead Example: 1 set of 6 apps. with 2 Subsets of 3 applications: Time Time slice Communications In-core  co-scheduling Application Subset context switch Out-of-core  gang scheduling Our approach 1/2: Hybrid 8

OS virtual memory management Checkpointing Basic OS virtual memory management: • Paging in pages on request • Paging out pages on replacement (LRU) • Interaction with OS scheduler (some OS are deliberately unfair in out-of-core situation.) • Poor performance for HPC applications Memory Memory page1 page2 page2 page1 page2 page2 page2 page3 page2 page3 page3 page3 page3 page3 free page1 page1 page1 free page1 Disk Disk Adaptive Memory Paging for Gang-Scheduling [Ry04] Best performance for gang scheduling is obtained by: Selective paging out (swapping out only pages of descheduled processes) Aggressive paging out (evicting pages of descheduled processes at once) Adaptive paging In (swapping in pages of scheduled process at once) Good but requires deep kernel modifications. Our approach: user level application Subset checkpointing Checkpointing provides the same benefits than 1), 2) and 3). Works for Co-scheduling as well as for Gang-Scheduling Does not require any kernel modification! We need a parallel application (MPI) checkpoint mechanism Our approach 2/2: Checkpointing 9

Event Loggers Checkpoint servers Channel Memories Dispatcher Checkpoint Scheduler Fault detector 1 2 Network Mpi processes Daemons 3 node node Implementation using MPICH-V Framework MPICH-V framework: a set of components A MPICH-V protocol: a composition of a subset of these components node 10

Checkpoint protocol selection:Coordinated or uncoordinated? 6 protocols implemented in MPICH-V: 1 coordinated (Chandy-Lamport) 2 uncoordinated + pessimist mess. log. 3 uncoordinated + causal log. The coordinated one provides the best performance for fault free execution 11

Flushing the network (Chandy-Lamport) Checkpointing the communication stack (Parakeet, Meoisys, Score) time P1 P0 P0 P1 Ckpt. Sig. B1 B0 B0 B1 Ckpt. Comm. Buff. Flush In-transit Mess. B1 P1 B0 B1 P0 P0 Rstrt. Mess. delivery Rstrt. Sig. Mess. delivery time Checkpoint may last longer Restart may last longer We expect minor Perf. Diff. between the 2 approaches Checkpoint/restart of the comm. Stack requires OS modifications So we implemented the Chandy-Lamport approach Coordinated Checkpoint: 2 ways Ckpt. Image (P1) = + Ckpt. Image (P1) = + P1 12

Coordinated checkpointing, (Chandy-Lamport) When receiving a Checkpoint tag, start checkpoint + store any incoming mess. Store all incoming messages in checkpoint image Send checkpoint tag to all neighbors in the topology. Checkpoint is finished when a Tag has been received from all neighbors 4) After a crash, all nodes retrieve checkpoint images 5) Deliver stored in-transit messages to restarted processes MPICH-V/CL protocol Reference protocol for coordinated checkpointing 13

MPICH-V Nodes. Dispatcher Nodes. Network Network Master Scheduler Ckpt. Sched. Ckpt. Sched. Ckpt. Serv. Daemons Implementation details Dispatchers Co-scheduling: Several Dispatchers (no master/checkpoint scheduler) Gang and (Hybrid): Master Scheduler + several checkpoint schedulers • Master Scheduler issues a checkpoint order to the Checkpoint Scheduler(s) of running application(s) • When receiving this order, a Checkpoint Scheduler launches a coordinated checkpoint. Every running daemon computes the MPI process image and store it on the local disc. All daemons send a completion message to the Checkpoint Scheduler. • All running daemons stop the MPI process and their execution • The Master Scheduler selects the Checkpoint Scheduler(s) of other application(s) and sends a restart order. Every Checkpoint Scheduler receiving this order spawns new daemons restarting MPI processes from local images. 14

Outline • Introduction (related work) • A Hybride approach dedicated to out-of-core • Evaluation • Concluding remarks 15

Beowulf Cluster Methodology • LRI cluster: • Athlon 1800+ • 1GB memory • IDE ATA100 Disc • Ethernet 100Mbs • Linux 2.4.2 • Benchmark (MPI): • NAS BT (computation bound) • NAS CG (communication bound) • Time measurement: • Homogeneous Applications • Simultaneous launch (scripts) • Time is measured between the first launch and the last termination • Fairness is measured by response time standard deviation • Gang Scheduling time slice: 200 or 600 sec • Gang sched. also implemented by checkpointing (not OS signal) 16

We can imagine several policies to switch between set contexts Which one is the best for in-core and out-of-core situations? A) Sequential store and load: 1X: 1 context in memory Time slice B) Store and load in parallel: 2X: 2 contexts in memory Execution t Context Storage C) Load Prefetech: 2X: 2 contexts in memory Context Load Context switch overlap policy In core Near out-of-core Policies for NAS Bench. BT –C- 25 • Overlapping policies do not provide substantial improvements for the in-core situation 2) They need 2x the memory capacity to stay in-core. • the sequential policy is the best • We used it for the other xps. <3% 2X 1X 2X 2X 2X 2X 1X 17

Makespan: Execution time of N applications with Co and Gang Scheduling NAS Benchmark CG and BT CG BT Co-scheduling Checkpoint based Gang scheduling Co-scheduling Checkpoint based Gang scheduling Sec. 8000 8000 7000 7000 6000 6000 >> More than 24k 5000 5000 4000 4000 3000 3000 2000 2000 1000 1000 0 0 1 6 9 12 15 18 21 1 12 18 24 Number of BT-B-9 executed “simultaneously” Number of CG-C-8 executed “simultaneously” In-core Out-of-core In-core Out-of-core Co VS. Gang (Ckpt based) • Which scheduling strategy is the best for communication bound and compute bound applications? Co-scheduling is the best for in-core executions (but small advantage due to ~Checkpoint overhead + tinny Comm./comp. overlap) Gang scheduling outperforms co-scheduling for out-of-core (ckpt.)  Memory constraint is managed by checkpointing not by delaying jobs 18

>> More than 3000 Makespan: Execution time of N applications with Co, Gang and Hybrid Scheduling Comm./comp. Overlap Gang scheduling Hybrid scheduling (set of 5) Co-scheduling Gang scheduling Hybrid scheduling (set of 5) Time (minutes) Time (minutes) Chkp overhead Number of CG-C-8 executed “simultaneously” Number of BT-B-9 executed “simultaneously” In-core Out-of-core In-core Out-of-core • Gang and Hybrid scheduling outperform co-scheduling for out-of-core • Hybrid scheduling compares favorably to Gang scheduling on BT and OOC thanks to communication and computation overlap. Ckpt Gang VS. Ckpt Hybrid 19

Relative slowdown: (total time / # concur exec) / best of seq. times CG BT Co-scheduling Gang-Scheduling Relative Slowdown Relative Slowdown Hybrid-Scheduling 1,6 1,2 1,5 1,4 1,1 1,3 ref ref 1,2 1 1,1 1 0,9 0,9 0,8 0,8 6 9 12 15 18 21 12 14 16 18 20 22 24 # concurrent BT-B-9 executions # concurrent CG-C-8 executions Overhead comparison • What is the performance degradation due to time sharing? Gang and Hybrid scheduling add no performance penalty to CG (and also no improvement), Gang scheduling add 10% performance penalty to BT, Hybrid scheduling improves the performance by almost 10%, Difference is mostly due to communication/computation overlap. 20

In-core Time(s) Slightly out-of-core 3000 M=2251, SD=298, Diff=961 1400 M=1210, Diff=8 1200 2500 1000 2000 800 1500 600 1000 400 500 200 0 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Application rank Threads  Co-scheduling is highly unfair in out-of-core situation! # appls. # page misses per minute experienced by each node of an application (mean) # page misses per minute (mean over all applications) Stdr. Deviat. app0 app1 app2 app3 app4 .. .. .. app8 7 8.5 239.5 264.25 951 1145.25 .. .. .. 704 474.9 9 484.58 405.78 524.4 510.66 510.66 .. .. .. 509.4 507.4 47.1 1) Fairness deficiency for slightly out-of-core seems due to virtual mem. Mgnt. 2) Of course there should be some solution, but involving Kernel modification Co-scheduling Fairness (Linux) • How fair is co-scheduling for in-core and out-of-core?  Response time of BT 9 with modified memory sizes Page miss statistics for 7 and 9 BT C 25 (out-of-core) 21

Concluding remarks • Checkpoint based Gang Scheduling outperforms Co-scheduling and certainly classical (OS signal based) Gang scheduling on out-of-core situation (thanks to a better memory management) • Compared to known approaches, based on job admission control, the benefit of ckpt is that it avoids to delay some jobs • Hybrid scheduling, combining the two approaches + checkpointing, outperforms Gang scheduling on BT (presumably thanks to overlapping communications and computations) • More generally, Hybrid scheduling can take advantage of advanced co-scheduling approaches within a gang subset Work in progress: • Test with other applications / benchmarks • Compare with traditional gang scheduling based on OS signals • Experiments with high speed networks • Experiments on Hybrid scheduling with Co-scheduling optimizations 23

Meet us! at the INRIA booth 2345 INRIA Booth 2345 Mail contact: bouteiller@mpich-v.net

References [Ag03] S. Agarwal, G. Choi, C. R. Das, A. B. Yoo, and S. Nagar. Co-ordinated Coscheduling in time-sharing Clusters through a Generic Framework. In Proceedings of International Conference on Cluster Computing, December 2003. [Ar98] A. C. Arpaci-Dusseau, D. E. Culler, and A. M. Mainwaring. Implicit Scheduling With Implicit Information in Distributed Systems. In Proceedings of the 1998 ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems, pages 233–243, June 1998. [Ba00] Anat Batat and Dror G. Feitelson, « Gang Scheduling with Memory Considerations », in proceedings of IPDPS 2000. [Bo03] Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, and Franck Cappello, « Coordinated checkpoint versus message log for fault tolerant MPI », In IEEE International Conference on Cluster Computing (Cluster 2003). IEEE CS Press, december 2003. [Ch85] K. M. Chandy and L.Lamport, « Distributed snapshots: Determining global states of distributed systems » In Transactions on Computer Systems, volume 3(1), pages 63–75. ACM, February 1985. [Fe98] D. G. Feitelson and L. Rudolph, “Metrics and Benchmarking for Parallel Job Scheduling”. In Job Scheduling Strategies for Parallel Processing, LNCS vol. 1495, pp. 1–24, Springer-Verlag, Mar 1998. [Fr03] Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003 [Ho98] Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa, « Overhead analysis of preemptive gang scheduling », Lecture Notes in Computer Science, 1459 :217–230, April 1998. [Ky04] Kyung Dong Ryu, Nimish Pachapurkar, Liana L. Fong, « Adaptive Memory Paging for Efficient Gang Scheduling of Parallel Applications”, in proceedings of IPDPS 2004. [Na99] S. Nagar, A. Banerjee, A. Sivasubramaniam, and C. R. Das. Alternatives to Coscheduling a Network of Workstations. Journal of Parallel and Distributed Computing, 59(2):302–327, November 1999. [Ni02] Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos, « Adaptive Scheduling under Memory Pressure on Multiprogrammed Clusters”, CCGRID 2002 [Sa04] Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das, “Coscheduling in Clusters: Is It a Viable Alternative?”, to appear in SC2004 [Se99] S. Setia, M. S. Squillante, and V. K. Naik. The Impact of Job Memory Requirements on Gang-Scheduling Performance. ACM SIGMETRICS Performance Evaluation Review, 26(4):30–39, 1999. [So98] P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien. Dynamic Coscheduling on Workstation Clusters. In Proceedings of the IPPS Workshop on Job Scheduling Strategies for Parallel Processing, pages 231–256, March 1998. [St04] Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, to appear in Cluster 2004. [Wi03] Yair Wiseman, Dror G. Feitelson, « Paired Gang Scheduling », IEEE TPDS, June 2003

Related work optimization optimizations: -Memory management (mainly based on job admission control): • Impact of Memory Requirements on Gang-Scheduling Performance [Se99] (cont. of multiprog.) • Gang Scheduling with Memory Considerations [Ba00] (job admission control to avoid swapping) • Memory aware Co-scheduling [Ch04] (job admission control to avoid swapping). • Adaptive Memory Paging for Gang-Scheduling [Ry04] (Improving memory paging in-out) -Communications(concerns co-scheduling): • ICS (Implicit Co-scheduling), SB (spin Blocking), CC (Coordinated Co-scheduling): self descheduling after timeout on communication [Ar98][Na99] • DCS (Dynamic Co-scheduling): incoming message triggers receiver scheduler [So98] • PB (Periodic Boost): schedule receiver from Periodic check of receiving buffer [Na99] 26

In-core Time (s) Time (s) 3000 Comp 2.4.2 1000 Comm. 2.4.2 Exec 2.4.2 Comp 2.6.2 1000 Comm 2.6.2 Exec 2.6.2 100 Comp 2.6.7 Comm. 2.6.7 Exec 2.6.7 10 100 1 2 3 4 5 1 2 3 4 5 # of concurrent CG A 4 # of concurrent BT A 9 Yes, performance of co-sheduling (in core) depends on the kernel 1) Kernel 2.6.2 is less efficient (much less for CG) 2) Kernel 2.6.7 and 2.4.2 provides overall similar performance  Careful selection of kernel version OR restriction (desactivation) of Co-scheduling Is result for in-core situationKernel dependent (Linux)? Kernel 2.4.2 was used in our experiment How time sharing efficiency evolves with Linux kernel maturation (from 2.4 to 2.6)? 27

Hybrid Preemptive Scheduling of MPI applications

Hybrid Preemptive Scheduling of MPI applications

Presentation Transcript

Hybrid Programming with OpenMP and MPI

Preemptive Scheduling and Mutual Exclusion with Hardware Support

MPI-3 Hybrid Working Group Status

Hybrid MPI/CUDA

Hybrid Redux : CUDA / MPI

Synchronizing the timestamps of concurrent events in traces of hybrid MPI/ OpenMP applications

Finishing Flows Quickly with Preemptive Scheduling

Hybrid openmp / mpi

Online Preemptive Scheduling with Immediate Decision or Notification and Penalties

Preemptive Scheduling of Intrees on Two Processors

Scheduling Generic Parallel Applications –Meta-scheduling

Hybrid OpenMP and MPI Programming

Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code

Assignment 2 Non-preemptive scheduling

Preemptive Scheduling

Preemptive Behavior Analysis and Improvement of Priority Scheduling Algorithms

Finishing Flows Quickly with Preemptive Scheduling

Preemptive Scheduling

Optimised MPI for HPEC applications

Preemptive Scheduling

Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids