1 / 29

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

20th International Parallel and Distributed Processing Symposium 25-29 April 2006. Dynamic Multi Phase Scheduling for Heterogeneous Clusters. Florina M. Ciorba † , Theodore Andronikos † , Ioannis Riakiotakis † , Anthony T. Chronopoulos ‡ and George Papakonstantinou †.

haamid
Download Presentation

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 20th International Parallel and Distributed Processing Symposium • 25-29 April 2006 Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba†, Theodore Andronikos†, Ioannis Riakiotakis†, Anthony T. Chronopoulos‡ and George Papakonstantinou† • † National Technical University of Athens • Computing Systems Laboratory • ‡ University of Texas at San Antonio • cflorina@cslab.ece.ntua.gr • www.cslab.ece.ntua.gr

  2. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  3. Introduction • Motivation for dynamically scheduling loops with dependencies: • Existing dynamic algorithms can not cope with dependencies, because they lack inter-slave communication • Static algorithms are not always efficient • In their original form, if dynamic algorithms are applied to loops with dependencies, they yield a serial/invalid execution IPDPS 2006

  4. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  5. Notation • Algorithmic model: FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++) Loop Body ENDFOR … ENDFOR ENDFOR • Perfectly nested loops • Constant flow data dependencies • General program statements within the loop body • J – index space of an n-dimensional uniform dependence loop IPDPS 2006

  6. Notation • u1– synchronization dimension, un – scheduling dimension • – set of dependence vectors • PE – processing element • P1,...,Pm– slaves • N – number of scheduling steps • Ci – chunk size at the i-th scheduling step • Vi–size (iteration-wise) of Ci along scheduling dimension un • VPk – virtual computing power of slave Pk • Qk – number of processes in the run-queue of slave Pk • –available computing power of slave Pk • – totalavailable computing power of the cluster IPDPS 2006

  7. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  8. Some existing self-scheduling algorithms u2 VN ... Vi+1 Vi Vi-1 Ci+1 Ci Ci-1 D T SS T SS C SS ... u1 V1 • 3 self-scheduling algorithms: • CSS – Chunk Self-Scheduling, Ci = constant • TSS – Trapezoid Self-Scheduling, Ci = Ci-1 – D, where D – decrement, and the first chunk is F = |J|/(2×m) and the last chunk is L = 1. • DTSS – Distributed TSS, Ci = Ci-1 – D, where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1. • CSS and TSS are devised for homogeneous systems • DTSS improves on TSS for heterogeneous systems by selecting the chunk sizes according to: • the virtual computational power of the slaves, Vk • the number of processes in the run-queue of each PE, Qk IPDPS 2006

  9. Some existing self-scheduling algorithms • |J|=5000×10000 • m = 10 slaves • CSS and TSS give the same chunk sizes both in dedicated and non-dedicated systems, respectively • DTSS adjusts the chunk sizes to match the different Akof slaves IPDPS 2006

  10. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  11. More notation • SP– synchronization point • M – number of SPs inserted along synchronization dimension u1 • H – interval (iteration-wise) between two SPs along u1 • H– is the same for every chunk • SCi,j – the set of iterations of Cibetween SPj-1 and SPj • Ci = Vi× M × H • Current slave– the slave assigned chunk Ci • Previous slave– the slave assigned chunk Ci-1 IPDPS 2006

  12. Self-scheduling with synchronization • Chunks are formed along scheduling dimension, here say u2 • SPsare inserted along synchronization dimension, u1 • Phase 1: Apply self-scheduling algorithms to the scheduling dimension • Phase 2: Insert synchronization points along the synchronization dimension IPDPS 2006

  13. The inter-slave communication scheme SPj SPj+1 SPj+2 communication set t+1 Ci+1 Ci Ci-1 Pk+1 Pk Pk-1 set of points computed at moment t+1 t t+1 SCi,j+1 set of points computed at moment t t indicates communication SCi-1,j+1 auxiliary explanations • Ci-1 is assigned to Pk-1, Ci assigned to Pk and Ci+1 to Pk+1 • When Pk reaches SPj+1, it sends to Pk+1only the data Pk+1 requires (i.e., those iterations imposed by the existing dependence vectors) • Afterwards, Pk receives from Pk-1 the data required for the current computation • Slaves do not reach a SP at the same time, which leads to a wavefrontexecution fashion IPDPS 2006

  14. Dynamic Multi-Phase Scheduling DMPS(x) INPUT: (a) An n-dimensional dependence nested loop. (b) The choice of the algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size Ci. (d) The synchronization interval H. (e) The number of slavesm; in case of DTSS, the virtual power Vk of every slave. Master: Initialization: (M.a) Register slaves. In case of DTSS, slaves report their Ak. (M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given Ci. While there are unassigned iterations do: (M.1) If a request arrives, put it in the queue. (M.2) Pick a request from the queue, and compute the next chunk size using CSS, TSS or DTSS. (M.3) Update the current and previous slave ids. (M.4) Send the id of the current slave to the previous one. IPDPS 2006

  15. Dynamic Multi-Phase Scheduling DMPS(x) Slave Pk: Initialization: (S.a) Register with the master. In case of DTSS, report Ak. (S.b) Compute M according to the given H. (S.1) Send request to the master. (S.2) Wait for reply; if received chunk from master, go to step 3, else go to OUTPUT. (S.3) While the next SP is not reached, compute chunk i. (S.4) If id of the send-to slave is known, go to step 5, else go to step 6. (S.5) Send computed data to send-to slave (S.6) Receive data from the receive-from slave and go to step 3. OUTPUT Master: If there are no more chunks to be assigned, terminate. Slave Pk: If no more tasks come from master, terminate. IPDPS 2006

  16. Dynamic Multi-Phase Scheduling DMPS(x) • Advantages of DMPS(x) • Can take as input any self-scheduling algorithm, without any modifications • Phase 2 is independent of Phase 1 • Phase 1 deals with the heterogeneity & load variation in the system • Phase 2 deals with minimizing the inter-slave communication cost • Suitable for any type of heterogeneous systems IPDPS 2006

  17. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  18. Implementation and testing setup • The algorithms are implemented in C and C++ • MPI platform is used for master-slave and inter-slave communication • The heterogeneous system consists of 10 machines: • 4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VPk = 1.5 (one of them is the master) • 6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids), assumed to have VPk = 0.5. • Interconnection network is Fast Ethernet, at 100Mbit/sec. • Dedicated system: all machines are dedicated to running the program and no other loads are interposed during the execution. • Non-dedicated system: at the beginning of program’s execution, a resource expensive process is started on some of the slaves, halving their Ak. IPDPS 2006

  19. Implementation and testing setup • System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. • Three series of experiments for both dedicated & non-dedicated systems, for m = 3,4,5,6,7,8,9 slaves: • DMPS(CSS) • DMPS(TSS) • DMPS(DTSS) • Two real-life applications: heat equation, Floyd-Steinberg computation • Speedup Sp is computed with: where TPi – serial execution time on slave Pi, 1 ≤i≤m, and TPAR– parallel execution time (on m slaves) • In the plotting of Sp, VP is used instead of m on the x-axis. IPDPS 2006

  20. Performance results – Heat equation IPDPS 2006

  21. Performance results – Heat equation IPDPS 2006

  22. Performance results – Floyd-Steinberg IPDPS 2006

  23. Performance results – Floyd-Steinberg IPDPS 2006

  24. Interpretation of the results • Dedicated system: • as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one. • DMPS(TSS) slightly outperforms DMPS(CSS) for parallel loops, because it provides better load balancing • DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity • Non-dedicated system: • DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations • The speedup for DMPS(DTSS) increases in all cases • H must be chosen so as to maintain the comm/compratio < 1, for every test case • Even then, small variations of the value of H, do not significantly affect the overall performance. IPDPS 2006

  25. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  26. Conclusions • Loops with dependencies can now be dynamically scheduled on heterogeneous dedicated & non-dedicated systems • Distributed algorithms efficiently compensate for the system’s heterogeneity for loops with dependencies, especially in non-dedicated systems IPDPS 2006

  27. Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

  28. Future work • Establish a model for predicting the optimal synchronization interval H and minimize the communication • Extend all other self-scheduling algorithms, such that they can handle loops with dependencies and account for system’s heterogeneity IPDPS 2006

  29. Thank you Questions? IPDPS 2006

More Related