1 / 35

OpenMP Lecture #4 Performance Tuning

OpenMP Lecture #4 Performance Tuning . Josh Ladd Grad 511 February 16, 2011. OpenMP and Performance . OpenMP is great, it’s so simple to use!!! Easy to implement and get running Getting good performance is another story…

xandy
Download Presentation

OpenMP Lecture #4 Performance Tuning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenMP Lecture #4Performance Tuning Josh Ladd Grad 511 February 16, 2011 OpenMP lec #2

  2. OpenMPand Performance • OpenMP is great, it’s so simple to use!!! • Easy to implement and get running • Getting good performance is another story… • Shared memory programming introduces overheads, pitfalls and traps not encountered in serial programming • OpenMP itself has many “knobs” to consider in the tuning process • Not just OpenMP directives and clauses to consider, but also system parameters to consider: • Memory locality and coherency issues • Process and memory affinity options • Compiler options – different OpenMP implementations support different “features” and/or code analysis tools • Environment variables • Runtime options OpenMPlec #2

  3. Shared Memory Overheads and Considerations • Thread creation and scheduling i.e. forking a team of threads • Memory bandwidth scalability - Von Neumann bottleneck • Synchronization and memory consistency costs • False sharing and cache utilization in general - requires a programmer to think long and hard about data layout and parallel access patterns OpenMPlec #2

  4. Measured overhead of forking a team of OpenMP threads OpenMP lec #2

  5. Case Study - Dot Product inti; double A[N], B[N], sum; for( i = 0; i < N; i++) { sum = sum + A[i]*B[i]; } • Parallelize this using a parallel for-loop directive inti; double A[N], B[N], sum; #pragma omp parallel default(…) private(…) schedule(…[,chunksize] ) for( i = 0; i < N; i++) { sum = sum + A[i]*B[i]; } • Hooray, parallel programming is easy !!! OpenMP lec #2

  6. Parallel do/for loop - Choosing a schedule to fit your problem #pragma omp do schedule( type [ ,chunk]) • static - problem can be divided into equal quanta of work (dot product) • dynamic - problem cannot be broken up into equal chunks, use dynamic scheduling to load balance computation on the fly • guided - chunk size decreases exponentially • auto - compiler and runtime decision OpenMPlec #2

  7. PerformanceRef: Chapman, Jost and Van Der Pas, Using OpenMP, The MIT Press, Cambridge, MA, 2008. OpenMP lec #2

  8. Choosing a chunksize • On Jaguar compute nodes, cache lines are 64 byte, chunksizes below this may result in false sharing or cache thrashing • For best results, chunksize should be a statically defined variable, i.e. #defchunksize X or parameter( chunksize = X) • Need to give threads enough work to amortize the cost of scheduling • Need to find even more work when using a dynamic schedule OpenMP lec #2

  9. What’s wrong with this code? real*8 sum(1:NTHREADS), a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank= omp_get_thread_num() sum = 0.0 !$ompdo schedule(static) do n = 1, N sum(My_rank) = sum(My_rank) + a(n)*b(n) end do !$omp end do dotp = 0.0 do n = 0, Num_threads - 1 dotp = dotp + sum(n) end do !$omp end parallel OpenMP lec #2

  10. Observed Performance OpenMP lec #2

  11. Solution • False sharing in the read/modify/write array sum • Each time a thread writes to the sum array, it will invalidate the cache lines of neighboring threads - adjusting the chunksize won’t help!! • Variable dotp should be declared private - again false sharing and bordering on a race condition • False sharing can be VERY detrimental to shared memory performance OpenMP lec #2

  12. A Better Solution real*8 my_sum, a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank = omp_get_thread_num() sum = 0. !$ompdo schedule(static) reduction( + : my_sum ) do n = 1, N my_sum= my_sum+ a(n)*b(n) end do !$omp end do !$omp end parallel • Rule of thumb - use library collectives before implementing your own… OpenMPlec #2

  13. Synchronization Points • Warning!! There is an implied barrier at the end of a parallel do/for loop • Barriers are expensive and should only be used if necessary for consistency - should use generously when debugging !$omp do do n = 1, l sum(My_rank) = sum(My_rank) + a(n)*b(n) end do !$omp end do !threads wait here until everyone is finished • Question - is the barrier necessary in this case?? OpenMP lec #2

  14. The “nowait” clause • Allow for asynchronous (non-blocking) progress with the NO WAIT clause • Fortran !$ompparallel do schedule(static) private(sum) do n = 1, l sum = sum + a(n)*b(n) end do !$omp end parallel do nowait • C/C++ #pragma omp parallel for schedule(static) private(sum) nowait for( i = 0; i < N; i++) { sum = sum + a[n]*b[n] } OpenMP lec #2

  15. Critical/atomic regions, flush • critical/atomic - used to guard against race conditions • Flush - Memory consistency • Question - What’s wrong with this code?? #pragma omp parallel for for ( i = 0; i < N; i++) { #pragma omp critical { if ( arr[i] > max ) max = arr[i]; } } • Critical region is in the critical path and you incur the cost N times you’ve also completely lost any concurrency in this particular example. OpenMP lec #2

  16. Slightly Better Solution #pragma omp parallel for for ( i = 0; i < N; i++) { #pragma ompflush(max) if( arr[i] > max ) { #pragma omp critical { if( arr[i] > max ) max = arr[i]; } } } • Typically, a flush is less expensive than the overhead associated with a critical region - also increases concurrency OpenMP lec #2

  17. Flush - “Message Passing” • Explicit memory fence in OpenMP • Question - what is the output of P1’s print? #pragma omp parallel shared(flag, y) \ private(x) { P1 P2 while(!flag) y = 3 x = y; flag = 1 print x } • Without a flush, no guarantee of the order in which y and flag are written back to memory • P1 may “see” flag = 1 before it “sees” y = 3. OpenMPlec #2

  18. Memory Consistency • Only way to guarantee “message passing” consistency is to flush before reading and flush after writing a shared variable accessed by multiple threads #pragma omp parallel shared(flag, y) \ private(x) { P1 P2 flush(flag, y) while(!flag) y = 3 flush(flag,y) flag = 1 x = y; flush(flag,y) print x } • Now, x is guaranteed to be 3 at the print - can harm performance though OpenMPlec #2

  19. Simple Performance Metrics - …how good is “good”? • Parallel efficiency and speedup • Two ways to visualize parallel data • Measures how close to perfect parallelism: Peak theoretical performance • Very difficult/impossible to get perfect speedup or achieve 100% parallel efficiency OpenMP lec #2

  20. Parallel Performance Metrics • Parallel efficiency • Speedup OpenMP lec #2

  21. Timing and Profiling • Measure with a thread-safe timer • Standard system calls such as gettimeofday() are not thread safe • Execution time for parallel region will be more than elapsed real time, timing register accumulates values for all threads • Crude approximation = measured_time/num_threads • omp_get_wtime() is a portable, thread-safe timer • Per thread timing OpenMP lec #2

  22. Timing example tic = omp_get_wtime(); #pragma omp parallel for schedule(static) \ private(sum) nowait for( i = 0; i < N; i++) { sum = sum + a[n]*b[n] } tic = omp_get_wtime() - tic fprint(stdout,”Time for parallel do loop %f\n”,tic) OpenMP lec #2

  23. Code Profiling • Amount of time spent in individual functions • Hardware counters (PAPI) • cache misses • cpu cycles • flops • etc • CrayPAT is the native profiling and code instrumentation tool on the Cray • Use with OpenMP, MPI, and OpenMP + MPI applications • Easy to use and pretty informative OpenMPlec #2

  24. Dot Product cont… • Question - what’s wrong with this code? parameter( CHUNK = 1) real*8 my_sum, a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank = omp_get_thread_num() sum(:) = 0. !$omp do schedule(static, CHUNK) reduction(my_sum : +) do n = 1, N my_sum = my_sum + a(n)*b(n) end do !$omp end do !$omp end parallel OpenMPlec #2

  25. Observed Performance OpenMP lec #2

  26. CrayPAT Example - 8 threads • lfs> module load xt-craypat • Rebuild your code with the craypat module loaded • lfs> ftn -o my_program -mp ./my_program.f90 • Instrument the code; produces the instrumented executable my_program+pat • lfs> pat_build -O apamy_program • Run the instrumented code • lfs> aprun -n 1 -d 12 ./my_program+pat • Use pat_report to process the files • lfs> pat_report -T -o report1.txt my_program+pat+PID-nodesdt.xf • Rebuild the program • lfs> pat_build -O my_program+pat+PID-nodesdt.apa • Run the new instrumented executable • lfs> aprun -n 1 -d 12 ./my_program+apa • Use pat_report to process the new data file • lfs> pat_report -T -o report2.txt my_program+apa+PID-nodesdt.xf OpenMP lec #2

  27. Chunksize = 1 • Table 1: Profile by Function Group and Function • Time % | Time |Imb. Time | Imb. | Calls |Group • | | | Time % | | Function • | | | | | Thread='HIDE' • 100.0% | 11.067847 | -- | -- | 40.0 |Total • |--------------------------------------------------------------- • | 100.0% | 11.066535 | -- | -- | 23.0 |USER • ||-------------------------------------------------------------- • || 87.3% | 9.662574 | 0.377535 | 4.5% | 10.0 |MAIN_.LOOP@li.47 • || 12.6% | 1.396741 | 1.222148 | 100.0% | 1.0 |MAIN_ • || 0.1% | 0.007096 | 0.006209 | 100.0% | 1.0 |exit • || 0.0% | 0.000112 | 0.000017 | 17.0% | 10.0 |MAIN_.REGION@li.47 • || 0.0% | 0.000013 | 0.000011 | 100.0% | 1.0 |main • ||============================================================== • | 0.0% | 0.001058 | 0.000926 | 100.0% | 10.0 |OMP • | | | | | | MAIN_.REGION@li.47(ovhd) • | 0.0% | 0.000253 | 0.000222 | 100.0% | 7.0 |PTHREAD • | | | | | | pthread_create • |=============================================================== OpenMP lec #2

  28. Chunksize = 1 - schedule(static, Chunksize) • USER / MAIN_.LOOP@li.47 • ------------------------------------------------------------------------ • Time% 87.3% • Time 9.662574 secs • Imb.Time 0.377535 secs • Imb.Time% 5.1% • Calls 1.0 /sec 10.0 calls • PAPI_L1_DCM 5.647M/sec 54567370 misses • PAPI_TLB_DM 0.364M/sec 3515491 misses • PAPI_L1_DCA 99.942M/sec 965695643 refs • PAPI_FP_OPS 23.286M/sec 225000010 ops • User time (approx) 9.663 secs 25122777671 cycles 100.0%Time • Average Time per Call 0.966257 sec • CrayPat Overhead : Time 0.0% • HW FP Ops / User time 23.286M/sec 225000010 ops 0.2%peak(DP) • HW FP Ops / WCT 23.286M/sec • Computational intensity 0.01 ops/cycle 0.23 ops/ref • MFLOPS (aggregate) 23.29M/sec • TLB utilization 274.70 refs/miss 0.537 avg uses • D1 cache hit,miss ratios 94.3% hits 5.7% misses • D1 cache utilization (misses) 17.70 refs/miss 2.212 avg hits • ======================================================================== OpenMP lec #2

  29. Chunksize = default - schedule(static) • Table 1: Profile by Function Group and Function • Time % | Time |Imb. Time | Imb. | Calls |Group • | | | Time % | | Function • | | | | | Thread='HIDE' • 100.0% | 2.575582 | -- | -- | 40.0 |Total • |-------------------------------------------------------------- • | 99.9% | 2.572372 | -- | -- | 23.0 |USER • ||------------------------------------------------------------- • || 52.0% | 1.340469 | 1.172911 | 100.0% | 1.0 |MAIN_ • || 47.6% | 1.224797 | 0.004662 | 0.4% | 10.0 |MAIN_.LOOP@li.45 • || 0.3% | 0.007000 | 0.006125 | 100.0% | 1.0 |exit • || 0.0% | 0.000091 | 0.000015 | 19.3% | 10.0 |MAIN_.REGION@li.45 • || 0.0% | 0.000015 | 0.000013 | 100.0% | 1.0 |main • ||============================================================= • | 0.1% | 0.002983 | 0.002610 | 100.0% | 10.0 |OMP • | | | | | | MAIN_.REGION@li.45(ovhd) • | 0.0% | 0.000227 | 0.000199 | 100.0% | 7.0 |PTHREAD • | | | | | | pthread_create • |============================================================== OpenMP lec #2

  30. Chunksize = default • USER / MAIN_.LOOP@li.45 • ------------------------------------------------------------------------ • Time% 47.6% • Time 1.224797 secs • Imb.Time 0.004662 secs • Imb.Time% 0.5% • Calls 8.2 /sec 10.0 calls • PAPI_L1_DCM 0.846M/sec 1035963 misses • PAPI_TLB_DM 0.359M/sec 439124 misses • PAPI_L1_DCA 734.818M/sec 900031012 refs • PAPI_FP_OPS 183.698M/sec 225000010 ops • User time (approx) 1.225 secs 3184570214 cycles 100.0%Time • Average Time per Call 0.122480 sec • CrayPat Overhead : Time 0.0% • HW FP Ops / User time 183.698M/sec 225000010 ops 1.8%peak(DP) • HW FP Ops / WCT 183.698M/sec • Computational intensity 0.07 ops/cycle 0.25 ops/ref • MFLOPS (aggregate) 183.70M/sec • TLB utilization 2049.61 refs/miss 4.003 avg uses • D1 cache hit,miss ratios 99.9% hits 0.1% misses • D1 cache utilization (misses) 868.79 refs/miss 108.598 avg hits • ======================================================================== OpenMP lec #2

  31. Conclusion • Clearly chunksize = 1 is a bad choice for this problem • 64 byte cache line • eight threads implies each thread is accessing A and B with stride 8 = 64 bytes • Performance degradation is due to cache thrashing, not false sharing OpenMP lec #2

  32. Parallel efficiency – Jaguar PF OpenMP lec #2

  33. Speedup – Jaguar PF OpenMP lec #2

  34. aprun process and memory affinity flags • Remote NUMA-node memory references and process migration can reduce performance • Keep physical memory as close as possible to threads • aprun provides memory affinity and CPU affinity option • -cc cpu_list | cpu | numa_node • Binds pe’s to CPU’s • -d depth • Specifies the number of CPU’s for each pe and its threads • -S pes_per_numa_node • Specifies the number of pe’s to allocate per NUMA node, may only apply to XT5 see man page… • -ss specifies strict memory containment per NUMA node, when -ss is set a PE can allocate only the memory local to its assigned NUMA node - might be XT5 only see man page OpenMP lec #2

  35. Performance Killers - Things that will degrade the performance of your code • Fine grain parallelism - threads need a lot of work in order to amortize the cost to create, schedule and execute a parallel region • Synchronization - Careful, sometimes synchronization is implicit in OpenMP, e.g. end of parallel do/for loop! • liberal dose when debugging, but the goal in performance tuning will be to figure out how to remove as much as possible • False sharing – more than a single thread accessing the same cache line on a cache coherent system resulting in cache line invalidations => cache misses • Flushing variables for memory consistency - compiler cannot optimize these variables • Serialization – too many atomic updates and/or critical regions and critical regions with too much work • Amdahl’s law - “The speedup of a program is limited by the time needed for the sequential fraction of the program” OpenMP lec #2

More Related