1 / 52

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes. Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory. Previous work .

jin
Download Presentation

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory

  2. Previous work • M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris, "Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, 2002. • G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April  2001 . Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  3. Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  4. Nested For-Loops for (i1=l1; i1<=u1; i1++) for (i2=l2; i2<=u2; i2++) … … … … … for (in=ln; in<=un; in++) { Loop Body } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  5. Dependence Vectors i2 for (i1=0; i1<=7; i1++) for (i2=0; i2<=7; i2++) A[i,j]=A[i-1,j]+A[i,j-1] i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  6. Tiling i2 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  7. Tiling i2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  8. Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  9. Non-Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  10. P3 P3 P2 P2 P1 P1 P0 P0 Non-Overlapping vs. Overlapping Scheme Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  11. Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  12. Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  13. Generalization to SMPs – “Grouping” CPU1 SMP3 CPU0 CPU1 SMP2 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  14. Example: Grouping + Non overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,0) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  15. Example: Grouping + Overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,1) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  16. Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  17. Scheduling onto a Fixed Number of SMPs • Dynamic Scheduling by the Operating System • Run time overhead for generating a lot of processes • Context switching slows down the execution • Static Scheduling at Compile Time Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  18. Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  19. Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  20. chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  21. Cyclic Assignment – Non Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  22. Cyclic Assignment - Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  23. chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment - Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  24. Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  25. chunk Mirror Assignment Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  26. Mirror Assignment – Non Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  27. Mirror Assignment - Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  28. Mirror Assignment - Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  29. Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  30. Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 tiles “TILE” CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  31. GROUPS TILES Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  32. Cluster Assignment – Non Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  33. Cluster Assignment –Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  34. GROUPS TILES Cluster Assignment - Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  35. Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  36. Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  37. Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 retaining computation volume of a tile CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  38. Retiling – Non Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  39. Retiling –Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  40. Retiling - Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  41. Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  42. Experimental Platform • Linux SMP (Symmetric Multi-Processors) Cluster • 2 nodes • 1GB RAM • 2 Pentium III 1266MHz • Myrinet high performance interconnect • GM low level message passing system Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  43. The Myrinet interconnect • User-level Networking • Based on the GM message passing interface • All message exchange using DMA • Directly to/from pinned userspace buffers • Communication is offloaded to the NIC • Programmable NIC • LANai RISC processor @ 133-333MHz • 2-8MB SRAM • 2+2Gbps full duplex fiber links Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  44. Application GM Library User GM kernel module Kernel NIC GM firmware GM Architecture • Comprised of three main parts • User library • Kernel driver • Firmware on NIC • OS bypass design • Regions of NIC memory mapped to the VM of a process Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  45. Sending and Receiving messages over Myrinet/GM Sending application Receiving application Buffer Event q Buffer Event q Host Host NIC NIC Send q Host DMA Recv q Host DMA LANai LANai Send DMA Recv DMA Send DMA Recv DMA Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  46. Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  47. Experimental results Non Overlapping Execution Scheme Overlapping Execution Scheme 1 1 retile 0.95 0.95 cluster cyclic 0.9 0.9 retile 0.85 0.85 0.8 0.8 cluster mirror Speedup / # processors Speedup / # processors 0.75 0.75 0.7 0.7 mirror 0.65 0.65 0.6 0.6 cyclic 0.55 0.55 0.5 0.5 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Height of Iteration Space Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  48. Non Overlapping Execution Scheme 1 0.9 0.8 retile 0.7 cluster Speedup / # processors 0.6 cyclic 0.5 mirror 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Simulation results Overlapping Execution Scheme retile 1 cyclic mirror 0.9 0.8 0.7 Speedup / # processors cluster 0.6 0.5 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  49. Non Overlapping Execution Scheme Overlapping Execution Scheme retile 1 1 0.9 0.9 cyclic 0.8 0.8 retile cluster mirror 0.7 0.7 cluster Speedup / # processors Speedup / # processors 0.6 0.6 0.5 0.5 cyclic 0.4 0.4 mirror 0.3 0.3 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 Height of Iteration Space Height of Iteration Space Simulation results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

  50. Advantages - Disadvantages Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

More Related