1 / 22

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. Motivation.

zeki
Download Presentation

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

  2. Motivation • fully permutable loops always a computational challenge for HPC • hybrid parallelization attractive for DSM architectures • currently, popular free message passing libraries provide limited multi-threading support • SPMD hybrid parallelization suffers from intrinsic load imbalance ICPP-HPSEC 2005

  3. Contribution • two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops • generic • simple to implement • experimental evaluation against micro-kernel benchmarks of different programming models • message passing • fine-grain hybrid • coarse-grain hybrid (unbalanced, balanced) ICPP-HPSEC 2005

  4. Algorithmic model foracross tile1 do … foracross tileNdo for tilen-1do Receive(tile); Compute(A,tile); Send(tile); Restrictions: • fully permutable loops • unitary inter-process dependencies ICPP-HPSEC 2005

  5. Message passing parallelization • tiling transformation • (overlapped?) computation and communication phases • pipelined execution • portable • scalable • highly optimized ICPP-HPSEC 2005

  6. Hybrid parallelization So… why bother? ICPP-HPSEC 2005

  7. Hybrid parallelization: why bother I shared memory programming model vs message passing programming model for shared memory architecture ICPP-HPSEC 2005

  8. Hybrid parallelization: why bother II DSM architectures are popular! ICPP-HPSEC 2005

  9. Fine-grain hybrid parallelization • incremental parallelization of loops • relatively easy to implement • popular • Amdahl’s law restricts parallel efficiency • overhead of thread structures re-initialization • restrictive programming model for many applications ICPP-HPSEC 2005

  10. Coarse-grain hybrid parallelization • generic SPMD programming style • good parallelization efficiency • no thread re-initialization overhead • more difficult to implement • intrinsic load imbalance assuming common funneled thread support level ICPP-HPSEC 2005

  11. fine-grain hybrid comp comp comm comm … comp coarse-grain hybrid comp comm comp … comp MPI thread support levels • single • masteronly • funneled • serialized • multiple ICPP-HPSEC 2005

  12. Load balancing Idea Consequence master thread assumes a smaller fraction of the process tile computational load compared to other threads ICPP-HPSEC 2005

  13. Load balancing (2) Assuming It follows T………total number of threads p………current process id ICPP-HPSEC 2005

  14. Load balancing (3) ICPP-HPSEC 2005

  15. Experimental Results • 8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel 2.4.26) • MPICH v.1.2.6 (--with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB) • Intel C++ compiler 8.1 (-O3 -static -mcpu=pentiumpro) • FastEthernet interconnection network ICPP-HPSEC 2005

  16. Alternating Direction Implicit (ADI) • Stencil computation used for solving partial differential equations • Unitary data dependencies • 3D iteration space (X x Y x Z) ICPP-HPSEC 2005

  17. ADI ICPP-HPSEC 2005

  18. Synthetic benchmark ICPP-HPSEC 2005

  19. Conclusions • fine-grain hybrid parallelization inefficient • unbalanced coarse-grain hybrid parallelization also inefficient • balancing improves hybrid model performance • variable balanced coarse-grain hybrid model most efficient approach overall • relative performance improvement increases for higher communication vs computation needs ICPP-HPSEC 2005

  20. Thank You! Questions? ICPP-HPSEC 2005

  21. ADI ICPP-HPSEC 2005

  22. Synthetic benchmark ICPP-HPSEC 2005

More Related