On pearls and perils of hybrid openmp mpi programming on the blue horizon
Download
1 / 24

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon' - penha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
On pearls and perils of hybrid openmp mpi programming on the blue horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser

San Diego Supercomputing Center


Overview
Overview Blue Horizon

  • Blue Horizon Hardware

  • Motivation for this work

  • Two methods of hybrid programming

  • Fine grain results

  • A word on coarse grain techniques

  • Coarse grain results

  • Time variability

  • Effects of thread binding

  • Final Conclusions


Blue horizon hardware
Blue Horizon Hardware Blue Horizon

  • 144 IBM SP High Nodes

  • Each node:

    • 8-way SMP

    • 4 GB memory

    • crossbar

  • Each processor:

    • Power3 222 MHz

    • 4 Flop/cycle

    • Aggregate peak 1.002 Tflop/s

  • Compilers:

    • IBM mpxlf_r, version 7.0.1

    • KAI guidef90, version 3.9


Blue horizon hardware1
Blue Horizon Hardware Blue Horizon

  • Interconnect (between nodes):

    • Currently:

      • 115 MB/s

      • 4 MPI tasks/node

        Must use OpenMP to utilize all processors

    • Soon:

      • 500 MB/s

      • 8 MPI tasks/node

        Can use OpenMP to supplement MPI (if it’s worth it)


Hybrid programming why use it
Hybrid Programming: why use it? Blue Horizon

  • Non-performance-related reasons

    • Avoid replication of data on the node

  • Performance-related reasons:

    • Avoid latency of MPI on the node

    • Avoid unnecessary data copies inside the node

    • Reduce latency of MPI calls between the nodes

    • Decrease global MPI operations (reduction, all-to-all)

  • The price to pay:

    • OpenMP Overheads

    • False sharing

      Is it really worth trying?


Hybrid programming
Hybrid Programming Blue Horizon

  • Two methods of combining MPI and OpenMP in parallel programs

    Fine grainCoarse grain

main program

! MPI initialization

....

! cpu intensive loop

!$OMP PARALLEL DO

do i=1,n

!work

end do

....

end

main program

!MPI initialization

!$OMP PARALLEL

....

do i=1,n

!work

end do

....

!$OMP END PARALLEL

end


Hybrid programming1
Hybrid programming Blue Horizon

Fine grain approach

  • Easy to implement

  • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO)

    Coarse grain approach

  • Time-consuming implementation

  • Performance: less overhead for thread creation


Hybrid npb using fine grain parallelism
Hybrid NPB using fine grain parallelism Blue Horizon

  • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB).

    Suite name# loops parallelized

    CG - Conjugate Gradient 18

    MG - Multi-Grid 50

    FT - Fourier Transform 8

  • Results shown are the best of 5-10 runs

  • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html







Hybrid npb using coarse grain parallelism mg suite

Task 1 Blue Horizon

Task 2

Thread 1

Thread 2

Task 3

Task 4

Hybrid NPB using coarse grain parallelism: MG suite

Overview of the method


Coarse grain programming methodology
Coarse grain programming methodology Blue Horizon

  • Start with MPI code

  • Each MPI task spawns threads once in the beginning

  • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region

  • Main arrays are global

  • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking.

  • Avoid using OMP DO

  • Careful with scoping and synchronization




Coarse grain results mg c class1

64 Blue Horizon

64

64

64

2x4

1x8

1x8

2x4

Coarse grain results - MG (C class)

  • Full node results

MPI Tasks x

OpenMP Threads

Max MOPS/CPU

Min MOPS/CPU

# of SMP Nodes

75.7

19.1

8

4x2

92.6

14.9

8

2x4

84.2

13.6

8

1x8

49.5

18.6

64

4x2

15.6

3.7

64

4x2

21.2

5.3

56.8

42.3

15.4

5.6

8.2

2.2


Variability
Variability Blue Horizon

  • 2 -- 5 times (on 64 nodes)

  • Seen mostly when the full node is used

  • Seen both in fine grain and coarse grain runs

  • Seen both with IBM and KAI compiler

  • Seen in runs on the same set of nodes as well as between different sets

  • On a large number of nodes, the average performance suffers a lot

  • Confirmed in micro-study of OpenMP on 1 node


Openmp on 1 node microbenchmark results
OpenMP on 1 node microbenchmark results Blue Horizon

http://www.sdsc.edu/SciComp/PAA/

Benchmarks/Open/open.html


Thread binding
Thread binding Blue Horizon

Question: is variability related to thread migration?

  • A study on 1 node:

    • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds

    • Monitor processor id and run time for each thread

    • Repeat 100 times

    • Threads bound OR not bound


Thread binding1
Thread binding Blue Horizon

Results for OMP_NUM_THREADS=8

  • Without binding, threads migrate in about 15% of the runs

  • With thread binding turned on there was no migration

  • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown

  • Slowdown occurs with/without binding

  • Effect of single thread slowdown

    • Probability that complete calculation will be slowed

      P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon

      P=0.9876 probability overall results slowed by 25%


Thread binding2
Thread binding Blue Horizon

  • Calculation was rerun

    • OMP_NUM_THREADS = 7

    • 12.5% reduction in computational power

    • No threads showed a slowdown, all ran in about 1.6 seconds

  • Summary

    • OMP_NUM_THREADS = 7

      • yields 12.5% reduction in computational power

    • OMP_NUM_THREADS = 8

      • 0.9876 probability overall results slowed by 25% independent of thread binding


Overall conclusions
Overall Conclusions Blue Horizon

Based on our study of NPB on Blue Horizon:

  • Fine grain hybrid approach is generally worse than pure MPI

  • Coarse grain approach for MG is comparable with pure MPI or slightly better

  • Coarse grain approach is time and effort consuming

  • Coarse grain techniques are given

  • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads

  • Thread binding does not influence performance


ad