1 / 45

Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. Overview.

royal
Download Presentation

Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

  2. Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  3. Introduction • Motivation: • SMP clusters • Hybrid programming models • Mostly fine-grain MPI-OpenMP paradigms • Mostly DOALL parallelization EuroPVM/MPI 2003

  4. Introduction • Contribution: • 3 programming models for the parallelization of nested loops algorithms • pure MPI • fine-grain hybrid MPI-OpenMP • coarse-grain hybrid MPI-OpenMP • Advanced hyperplane scheduling • minimize synchronization need • overlap computation with communication EuroPVM/MPI 2003

  5. Introduction Algorithmic Model: FOR j0 = min0 TO max0 DO … FOR jn-1 = minn-1 TO maxn-1 DO Computation(j0,…,jn-1); ENDFOR … ENDFOR • Perfectly nested loops • Constant flow data dependencies EuroPVM/MPI 2003

  6. Introduction Target Architecture: SMP clusters EuroPVM/MPI 2003

  7. Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  8. Pure MPI Model • Tiling transformation groups iterations into atomic execution units (tiles) • Pipelined execution • Overlapping computation with communication • Makes no distinction between inter-node and intra-node communication EuroPVM/MPI 2003

  9. Pure MPI Model Example: FOR j1=0 TO 9 DO FOR j2=0 TO 7 DO A[j1,j2]:=A[j1-1,j2] + A[j1,j2-1]; ENDFOR ENDFOR EuroPVM/MPI 2003

  10. j2 j1 Pure MPI Model CPU1 NODE1 CPU0 4 MPI nodes CPU1 NODE0 CPU0 EuroPVM/MPI 2003

  11. j2 j1 Pure MPI Model CPU1 NODE1 CPU0 4 MPI nodes CPU1 NODE0 CPU0 EuroPVM/MPI 2003

  12. Pure MPI Model tile0 = nod0; … tilen-2 = nodn-2; FOR tilen-1 = 0 TO DO Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); END FOR EuroPVM/MPI 2003

  13. Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  14. Hyperplane Scheduling • Implements coarse-grain parallelism assuming inter-tile data dependencies • Tiles are organized into data-independent subsets (groups) • Tiles of the same group can be concurrently executed by multiple threads • Barrier synchronization between threads EuroPVM/MPI 2003

  15. j2 j1 Hyperplane Scheduling CPU1 2MPI nodes NODE1 CPU0 x 2OpenMP threads CPU1 NODE0 CPU0 EuroPVM/MPI 2003

  16. j2 j1 Hyperplane Scheduling CPU1 2MPI nodes NODE1 CPU0 x 2OpenMP threads CPU1 NODE0 CPU0 EuroPVM/MPI 2003

  17. Hyperplane Scheduling #pragma omp parallel { group0 = nod0; … groupn-2 = nodn-2; tile0 = nod0 * m0 + th0; … tilen-2 = nodn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ; if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier } } EuroPVM/MPI 2003

  18. Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  19. Fine-grain Model • Incremental parallelization of computationally intensive parts • Relatively straightforward from pure MPI • Threads (re)spawned at computation • Inter-node communication outside of multi-threaded part • Thread synchronization through implicit barrier of omp parallel directive EuroPVM/MPI 2003

  20. Fine-grain Model FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } EuroPVM/MPI 2003

  21. Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  22. Coarse-grain Model • SPMD paradigm • Requires more programming effort • Threads are only spawned once • Inter-node communication inside multi-threaded part (requires MPI_THREAD_MULTIPLE) • Thread synchronization through explicit barrier (omp barrier directive) EuroPVM/MPI 2003

  23. Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } #pragma omp barrier } } EuroPVM/MPI 2003

  24. Summary: Fine-grain vs Coarse-grain EuroPVM/MPI 2003

  25. Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  26. Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) • Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) • FastEthernet interconnection • ADI micro-kernel benchmark (3D) EuroPVM/MPI 2003

  27. Alternating Direction Implicit (ADI) • Unitary data dependencies • 3D Iteration Space (X x Y x Z) EuroPVM/MPI 2003

  28. ADI – 4 nodes EuroPVM/MPI 2003

  29. ADI – 4 nodes • X < Y • X > Y EuroPVM/MPI 2003

  30. ADI X=512 Y=512 Z=8192 – 4 nodes EuroPVM/MPI 2003

  31. ADI X=128 Y=512 Z=8192 – 4 nodes EuroPVM/MPI 2003

  32. ADI X=512 Y=128 Z=8192 – 4 nodes EuroPVM/MPI 2003

  33. ADI – 2 nodes EuroPVM/MPI 2003

  34. ADI – 2 nodes • X < Y • X > Y EuroPVM/MPI 2003

  35. ADI X=128 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003

  36. ADI X=256 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003

  37. ADI X=512 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003

  38. ADI X=512 Y=256 Z=8192 – 2 nodes EuroPVM/MPI 2003

  39. ADI X=512 Y=128 Z=8192 – 2 nodes EuroPVM/MPI 2003

  40. ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003

  41. ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003

  42. Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

  43. Conclusions • Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm • Hybrid models can be competitive to the pure MPI paradigm • Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated • Programming efficiently in OpenMP not easier than programming efficiently in MPI EuroPVM/MPI 2003

  44. Future Work • Application of methodology to real applications and benchmarks • Work balancing for coarse-grain model • Performance evaluation on advanced interconnection networks (SCI, Myrinet) • Generalization as compiler technique EuroPVM/MPI 2003

  45. Questions? http://www.cslab.ece.ntua.gr/~ndros EuroPVM/MPI 2003

More Related