1 / 40

Delivering High Performance to Parallel Applications Using Advanced Scheduling

Delivering High Performance to Parallel Applications Using Advanced Scheduling. Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr

jensen
Download Presentation

Delivering High Performance to Parallel Applications Using Advanced Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

  2. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  3. Introduction • Motivation: • A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! • There is no complete method to generate code for non-rectangular tiles Parallel Computing 2003

  4. Introduction • Contribution: • Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces • Simulation of blocking and non-blocking communication primitives • Experimental evaluation of proposed scheduling scheme Parallel Computing 2003

  5. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  6. Background Algorithmic Model: FOR j1 = min1 TO max1 DO … FOR jn = minn TO maxn DO Computation(j1,…,jn); ENDFOR … ENDFOR • Perfectly nested loops • Constant flow data dependencies (D) Parallel Computing 2003

  7. Background Tiling: • Popular loop transformation • Groups iterations into atomic units • Enhances locality in uniprocessors • Enables coarse-grain parallelism in distributed memory systems • Valid tiling matrix H: Parallel Computing 2003

  8. Tiling Transformation Example: FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFOR ENDFOR Parallel Computing 2003

  9. j2 3 0 P = 0 3 1/3 0 H = 0 1/3 j1 Rectangular Tiling Transformation Parallel Computing 2003

  10. j2 3 3 P = 0 3 1/3 -1/3 H = 0 1/3 j1 Non-rectangular Tiling Transformation Parallel Computing 2003

  11. Why Non-rectangular Tiling? • Reduces communication 8 communication points 6 communication points • Enables more efficient scheduling schemes 6 time steps 5time steps Parallel Computing 2003

  12. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  13. Computation Distribution • We map tiles along the longest dimension to the same processor because: • It reduces the number of processors required • It simplifies message-passing • It reduces total execution times when overlapping computation with communication Parallel Computing 2003

  14. j2 P3 P2 P1 j1 Computation Distribution Parallel Computing 2003

  15. Data Distribution • Computer-owns rule: Each processor owns the data it computes • Arbitrary convex iteration space, arbitrary tiling • Rectangular local iteration and data spaces Parallel Computing 2003

  16. Data Distribution Parallel Computing 2003

  17. Data Distribution Parallel Computing 2003

  18. Data Distribution Parallel Computing 2003

  19. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  20. j2 P3 P2 P1 j1 Communication Schemes • With whom do I communicate? Parallel Computing 2003

  21. j2 P3 P2 P1 j1 Communication Schemes • With whom do I communicate? Parallel Computing 2003

  22. Communication Schemes • What do I send? Parallel Computing 2003

  23. Blocking Scheme j2 P3 P2 P1 j1 12 time steps Parallel Computing 2003

  24. Non-blocking Scheme j2 P3 P2 P1 j1 6 time steps Parallel Computing 2003

  25. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  26. Tiling Transformation Parallel SPMD Code Sequential Code Sequential Tiled Code Code Generation Summary Parallelization Tiling • Computation Distribution • Data Distribution • Communication Primitives Dependence Analysis Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Parallel Computing 2003

  27. Code Summary – Blocking Scheme Parallel Computing 2003

  28. Code Summary – Non-blocking Scheme Parallel Computing 2003

  29. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  30. Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) • g++ compiler v.2.95.4 (-O3) • FastEthernet interconnection • 2 micro-kernel benchmarks (3D): • Gauss Successive Over-Relaxation (SOR) • Texture Smoothing Code (TSC) • Simulation of communication schemes Parallel Computing 2003

  31. SOR • Iteration space M x N x N • Dependence matrix: • Rectangular Tiling: • Non-rectangular Tiling: Parallel Computing 2003

  32. SOR Parallel Computing 2003

  33. SOR Parallel Computing 2003

  34. TSC • Iteration space T x N x N • Dependence matrix: • Rectangular Tiling: • Non-rectangular Tiling: Parallel Computing 2003

  35. TSC Parallel Computing 2003

  36. TSC Parallel Computing 2003

  37. Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003

  38. Conclusions • Automatic code generation for arbitrary tiled spaces can be efficient • High performance can be achieved by means of • a suitable tiling transformation • overlapping computation with communication Parallel Computing 2003

  39. Future Work • Application of methodology to imperfectly nested loops and non-constant dependencies • Investigation of hybrid programming models (MPI+OpenMP) • Performance evaluation on advanced interconnection networks (SCI, Myrinet) Parallel Computing 2003

  40. Questions? http://www.cslab.ece.ntua.gr/~ndros Parallel Computing 2003

More Related