On pearls and perils of hybrid openmp mpi programming on the blue horizon
Download
1 / 24

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon' - penha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
On pearls and perils of hybrid openmp mpi programming on the blue horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser

San Diego Supercomputing Center


Overview
Overview Blue Horizon

  • Blue Horizon Hardware

  • Motivation for this work

  • Two methods of hybrid programming

  • Fine grain results

  • A word on coarse grain techniques

  • Coarse grain results

  • Time variability

  • Effects of thread binding

  • Final Conclusions


Blue horizon hardware
Blue Horizon Hardware Blue Horizon

  • 144 IBM SP High Nodes

  • Each node:

    • 8-way SMP

    • 4 GB memory

    • crossbar

  • Each processor:

    • Power3 222 MHz

    • 4 Flop/cycle

    • Aggregate peak 1.002 Tflop/s

  • Compilers:

    • IBM mpxlf_r, version 7.0.1

    • KAI guidef90, version 3.9


Blue horizon hardware1
Blue Horizon Hardware Blue Horizon

  • Interconnect (between nodes):

    • Currently:

      • 115 MB/s

      • 4 MPI tasks/node

        Must use OpenMP to utilize all processors

    • Soon:

      • 500 MB/s

      • 8 MPI tasks/node

        Can use OpenMP to supplement MPI (if it’s worth it)


Hybrid programming why use it
Hybrid Programming: why use it? Blue Horizon

  • Non-performance-related reasons

    • Avoid replication of data on the node

  • Performance-related reasons:

    • Avoid latency of MPI on the node

    • Avoid unnecessary data copies inside the node

    • Reduce latency of MPI calls between the nodes

    • Decrease global MPI operations (reduction, all-to-all)

  • The price to pay:

    • OpenMP Overheads

    • False sharing

      Is it really worth trying?


Hybrid programming
Hybrid Programming Blue Horizon

  • Two methods of combining MPI and OpenMP in parallel programs

    Fine grainCoarse grain

main program

! MPI initialization

....

! cpu intensive loop

!$OMP PARALLEL DO

do i=1,n

!work

end do

....

end

main program

!MPI initialization

!$OMP PARALLEL

....

do i=1,n

!work

end do

....

!$OMP END PARALLEL

end


Hybrid programming1
Hybrid programming Blue Horizon

Fine grain approach

  • Easy to implement

  • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO)

    Coarse grain approach

  • Time-consuming implementation

  • Performance: less overhead for thread creation


Hybrid npb using fine grain parallelism
Hybrid NPB using fine grain parallelism Blue Horizon

  • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB).

    Suite name# loops parallelized

    CG - Conjugate Gradient 18

    MG - Multi-Grid 50

    FT - Fourier Transform 8

  • Results shown are the best of 5-10 runs

  • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html







Hybrid npb using coarse grain parallelism mg suite

Task 1 Blue Horizon

Task 2

Thread 1

Thread 2

Task 3

Task 4

Hybrid NPB using coarse grain parallelism: MG suite

Overview of the method


Coarse grain programming methodology
Coarse grain programming methodology Blue Horizon

  • Start with MPI code

  • Each MPI task spawns threads once in the beginning

  • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region

  • Main arrays are global

  • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking.

  • Avoid using OMP DO

  • Careful with scoping and synchronization




Coarse grain results mg c class1

64 Blue Horizon

64

64

64

2x4

1x8

1x8

2x4

Coarse grain results - MG (C class)

  • Full node results

MPI Tasks x

OpenMP Threads

Max MOPS/CPU

Min MOPS/CPU

# of SMP Nodes

75.7

19.1

8

4x2

92.6

14.9

8

2x4

84.2

13.6

8

1x8

49.5

18.6

64

4x2

15.6

3.7

64

4x2

21.2

5.3

56.8

42.3

15.4

5.6

8.2

2.2


Variability
Variability Blue Horizon

  • 2 -- 5 times (on 64 nodes)

  • Seen mostly when the full node is used

  • Seen both in fine grain and coarse grain runs

  • Seen both with IBM and KAI compiler

  • Seen in runs on the same set of nodes as well as between different sets

  • On a large number of nodes, the average performance suffers a lot

  • Confirmed in micro-study of OpenMP on 1 node


Openmp on 1 node microbenchmark results
OpenMP on 1 node microbenchmark results Blue Horizon

http://www.sdsc.edu/SciComp/PAA/

Benchmarks/Open/open.html


Thread binding
Thread binding Blue Horizon

Question: is variability related to thread migration?

  • A study on 1 node:

    • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds

    • Monitor processor id and run time for each thread

    • Repeat 100 times

    • Threads bound OR not bound


Thread binding1
Thread binding Blue Horizon

Results for OMP_NUM_THREADS=8

  • Without binding, threads migrate in about 15% of the runs

  • With thread binding turned on there was no migration

  • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown

  • Slowdown occurs with/without binding

  • Effect of single thread slowdown

    • Probability that complete calculation will be slowed

      P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon

      P=0.9876 probability overall results slowed by 25%


Thread binding2
Thread binding Blue Horizon

  • Calculation was rerun

    • OMP_NUM_THREADS = 7

    • 12.5% reduction in computational power

    • No threads showed a slowdown, all ran in about 1.6 seconds

  • Summary

    • OMP_NUM_THREADS = 7

      • yields 12.5% reduction in computational power

    • OMP_NUM_THREADS = 8

      • 0.9876 probability overall results slowed by 25% independent of thread binding


Overall conclusions
Overall Conclusions Blue Horizon

Based on our study of NPB on Blue Horizon:

  • Fine grain hybrid approach is generally worse than pure MPI

  • Coarse grain approach for MG is comparable with pure MPI or slightly better

  • Coarse grain approach is time and effort consuming

  • Coarse grain techniques are given

  • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads

  • Thread binding does not influence performance