on pearls and perils of hybrid openmp mpi programming on the blue horizon
Download
Skip this Video
Download Presentation
On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

Loading in 2 Seconds...

play fullscreen
1 / 24

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon' - penha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on pearls and perils of hybrid openmp mpi programming on the blue horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser

San Diego Supercomputing Center

overview
Overview
  • Blue Horizon Hardware
  • Motivation for this work
  • Two methods of hybrid programming
  • Fine grain results
  • A word on coarse grain techniques
  • Coarse grain results
  • Time variability
  • Effects of thread binding
  • Final Conclusions
blue horizon hardware
Blue Horizon Hardware
  • 144 IBM SP High Nodes
  • Each node:
    • 8-way SMP
    • 4 GB memory
    • crossbar
  • Each processor:
    • Power3 222 MHz
    • 4 Flop/cycle
    • Aggregate peak 1.002 Tflop/s
  • Compilers:
    • IBM mpxlf_r, version 7.0.1
    • KAI guidef90, version 3.9
blue horizon hardware1
Blue Horizon Hardware
  • Interconnect (between nodes):
    • Currently:
      • 115 MB/s
      • 4 MPI tasks/node

Must use OpenMP to utilize all processors

    • Soon:
      • 500 MB/s
      • 8 MPI tasks/node

Can use OpenMP to supplement MPI (if it’s worth it)

hybrid programming why use it
Hybrid Programming: why use it?
  • Non-performance-related reasons
    • Avoid replication of data on the node
  • Performance-related reasons:
    • Avoid latency of MPI on the node
    • Avoid unnecessary data copies inside the node
    • Reduce latency of MPI calls between the nodes
    • Decrease global MPI operations (reduction, all-to-all)
  • The price to pay:
    • OpenMP Overheads
    • False sharing

Is it really worth trying?

hybrid programming
Hybrid Programming
  • Two methods of combining MPI and OpenMP in parallel programs

Fine grainCoarse grain

main program

! MPI initialization

....

! cpu intensive loop

!$OMP PARALLEL DO

do i=1,n

!work

end do

....

end

main program

!MPI initialization

!$OMP PARALLEL

....

do i=1,n

!work

end do

....

!$OMP END PARALLEL

end

hybrid programming1
Hybrid programming

Fine grain approach

  • Easy to implement
  • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO)

Coarse grain approach

  • Time-consuming implementation
  • Performance: less overhead for thread creation
hybrid npb using fine grain parallelism
Hybrid NPB using fine grain parallelism
  • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB).

Suite name# loops parallelized

CG - Conjugate Gradient 18

MG - Multi-Grid 50

FT - Fourier Transform 8

  • Results shown are the best of 5-10 runs
  • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html
hybrid npb using coarse grain parallelism mg suite

Task 1

Task 2

Thread 1

Thread 2

Task 3

Task 4

Hybrid NPB using coarse grain parallelism: MG suite

Overview of the method

coarse grain programming methodology
Coarse grain programming methodology
  • Start with MPI code
  • Each MPI task spawns threads once in the beginning
  • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region
  • Main arrays are global
  • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking.
  • Avoid using OMP DO
  • Careful with scoping and synchronization
coarse grain results mg c class1

64

64

64

64

2x4

1x8

1x8

2x4

Coarse grain results - MG (C class)
  • Full node results

MPI Tasks x

OpenMP Threads

Max MOPS/CPU

Min MOPS/CPU

# of SMP Nodes

75.7

19.1

8

4x2

92.6

14.9

8

2x4

84.2

13.6

8

1x8

49.5

18.6

64

4x2

15.6

3.7

64

4x2

21.2

5.3

56.8

42.3

15.4

5.6

8.2

2.2

variability
Variability
  • 2 -- 5 times (on 64 nodes)
  • Seen mostly when the full node is used
  • Seen both in fine grain and coarse grain runs
  • Seen both with IBM and KAI compiler
  • Seen in runs on the same set of nodes as well as between different sets
  • On a large number of nodes, the average performance suffers a lot
  • Confirmed in micro-study of OpenMP on 1 node
openmp on 1 node microbenchmark results
OpenMP on 1 node microbenchmark results

http://www.sdsc.edu/SciComp/PAA/

Benchmarks/Open/open.html

thread binding
Thread binding

Question: is variability related to thread migration?

  • A study on 1 node:
    • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds
    • Monitor processor id and run time for each thread
    • Repeat 100 times
    • Threads bound OR not bound
thread binding1
Thread binding

Results for OMP_NUM_THREADS=8

  • Without binding, threads migrate in about 15% of the runs
  • With thread binding turned on there was no migration
  • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown
  • Slowdown occurs with/without binding
  • Effect of single thread slowdown
    • Probability that complete calculation will be slowed

P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon

P=0.9876 probability overall results slowed by 25%

thread binding2
Thread binding
  • Calculation was rerun
    • OMP_NUM_THREADS = 7
    • 12.5% reduction in computational power
    • No threads showed a slowdown, all ran in about 1.6 seconds
  • Summary
      • OMP_NUM_THREADS = 7
        • yields 12.5% reduction in computational power
      • OMP_NUM_THREADS = 8
        • 0.9876 probability overall results slowed by 25% independent of thread binding
overall conclusions
Overall Conclusions

Based on our study of NPB on Blue Horizon:

  • Fine grain hybrid approach is generally worse than pure MPI
  • Coarse grain approach for MG is comparable with pure MPI or slightly better
  • Coarse grain approach is time and effort consuming
  • Coarse grain techniques are given
  • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads
  • Thread binding does not influence performance
ad