350 likes | 547 Views
OpenMP Lecture #4 Performance Tuning . Josh Ladd Grad 511 February 16, 2011. OpenMP and Performance . OpenMP is great, it’s so simple to use!!! Easy to implement and get running Getting good performance is another story…
 
                
                E N D
OpenMP Lecture #4Performance Tuning Josh Ladd Grad 511 February 16, 2011 OpenMP lec #2
OpenMPand Performance • OpenMP is great, it’s so simple to use!!! • Easy to implement and get running • Getting good performance is another story… • Shared memory programming introduces overheads, pitfalls and traps not encountered in serial programming • OpenMP itself has many “knobs” to consider in the tuning process • Not just OpenMP directives and clauses to consider, but also system parameters to consider: • Memory locality and coherency issues • Process and memory affinity options • Compiler options – different OpenMP implementations support different “features” and/or code analysis tools • Environment variables • Runtime options OpenMPlec #2
Shared Memory Overheads and Considerations • Thread creation and scheduling i.e. forking a team of threads • Memory bandwidth scalability - Von Neumann bottleneck • Synchronization and memory consistency costs • False sharing and cache utilization in general - requires a programmer to think long and hard about data layout and parallel access patterns OpenMPlec #2
Measured overhead of forking a team of OpenMP threads OpenMP lec #2
Case Study - Dot Product inti; double A[N], B[N], sum; for( i = 0; i < N; i++) { sum = sum + A[i]*B[i]; } • Parallelize this using a parallel for-loop directive inti; double A[N], B[N], sum; #pragma omp parallel default(…) private(…) schedule(…[,chunksize] ) for( i = 0; i < N; i++) { sum = sum + A[i]*B[i]; } • Hooray, parallel programming is easy !!! OpenMP lec #2
Parallel do/for loop - Choosing a schedule to fit your problem #pragma omp do schedule( type [ ,chunk]) • static - problem can be divided into equal quanta of work (dot product) • dynamic - problem cannot be broken up into equal chunks, use dynamic scheduling to load balance computation on the fly • guided - chunk size decreases exponentially • auto - compiler and runtime decision OpenMPlec #2
PerformanceRef: Chapman, Jost and Van Der Pas, Using OpenMP, The MIT Press, Cambridge, MA, 2008. OpenMP lec #2
Choosing a chunksize • On Jaguar compute nodes, cache lines are 64 byte, chunksizes below this may result in false sharing or cache thrashing • For best results, chunksize should be a statically defined variable, i.e. #defchunksize X or parameter( chunksize = X) • Need to give threads enough work to amortize the cost of scheduling • Need to find even more work when using a dynamic schedule OpenMP lec #2
What’s wrong with this code? real*8 sum(1:NTHREADS), a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank= omp_get_thread_num() sum = 0.0 !$ompdo schedule(static) do n = 1, N sum(My_rank) = sum(My_rank) + a(n)*b(n) end do !$omp end do dotp = 0.0 do n = 0, Num_threads - 1 dotp = dotp + sum(n) end do !$omp end parallel OpenMP lec #2
Observed Performance OpenMP lec #2
Solution • False sharing in the read/modify/write array sum • Each time a thread writes to the sum array, it will invalidate the cache lines of neighboring threads - adjusting the chunksize won’t help!! • Variable dotp should be declared private - again false sharing and bordering on a race condition • False sharing can be VERY detrimental to shared memory performance OpenMP lec #2
A Better Solution real*8 my_sum, a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank = omp_get_thread_num() sum = 0. !$ompdo schedule(static) reduction( + : my_sum ) do n = 1, N my_sum= my_sum+ a(n)*b(n) end do !$omp end do !$omp end parallel • Rule of thumb - use library collectives before implementing your own… OpenMPlec #2
Synchronization Points • Warning!! There is an implied barrier at the end of a parallel do/for loop • Barriers are expensive and should only be used if necessary for consistency - should use generously when debugging !$omp do do n = 1, l sum(My_rank) = sum(My_rank) + a(n)*b(n) end do !$omp end do !threads wait here until everyone is finished • Question - is the barrier necessary in this case?? OpenMP lec #2
The “nowait” clause • Allow for asynchronous (non-blocking) progress with the NO WAIT clause • Fortran !$ompparallel do schedule(static) private(sum) do n = 1, l sum = sum + a(n)*b(n) end do !$omp end parallel do nowait • C/C++ #pragma omp parallel for schedule(static) private(sum) nowait for( i = 0; i < N; i++) { sum = sum + a[n]*b[n] } OpenMP lec #2
Critical/atomic regions, flush • critical/atomic - used to guard against race conditions • Flush - Memory consistency • Question - What’s wrong with this code?? #pragma omp parallel for for ( i = 0; i < N; i++) { #pragma omp critical { if ( arr[i] > max ) max = arr[i]; } } • Critical region is in the critical path and you incur the cost N times you’ve also completely lost any concurrency in this particular example. OpenMP lec #2
Slightly Better Solution #pragma omp parallel for for ( i = 0; i < N; i++) { #pragma ompflush(max) if( arr[i] > max ) { #pragma omp critical { if( arr[i] > max ) max = arr[i]; } } } • Typically, a flush is less expensive than the overhead associated with a critical region - also increases concurrency OpenMP lec #2
Flush - “Message Passing” • Explicit memory fence in OpenMP • Question - what is the output of P1’s print? #pragma omp parallel shared(flag, y) \ private(x) { P1 P2 while(!flag) y = 3 x = y; flag = 1 print x } • Without a flush, no guarantee of the order in which y and flag are written back to memory • P1 may “see” flag = 1 before it “sees” y = 3. OpenMPlec #2
Memory Consistency • Only way to guarantee “message passing” consistency is to flush before reading and flush after writing a shared variable accessed by multiple threads #pragma omp parallel shared(flag, y) \ private(x) { P1 P2 flush(flag, y) while(!flag) y = 3 flush(flag,y) flag = 1 x = y; flush(flag,y) print x } • Now, x is guaranteed to be 3 at the print - can harm performance though OpenMPlec #2
Simple Performance Metrics - …how good is “good”? • Parallel efficiency and speedup • Two ways to visualize parallel data • Measures how close to perfect parallelism: Peak theoretical performance • Very difficult/impossible to get perfect speedup or achieve 100% parallel efficiency OpenMP lec #2
Parallel Performance Metrics • Parallel efficiency • Speedup OpenMP lec #2
Timing and Profiling • Measure with a thread-safe timer • Standard system calls such as gettimeofday() are not thread safe • Execution time for parallel region will be more than elapsed real time, timing register accumulates values for all threads • Crude approximation = measured_time/num_threads • omp_get_wtime() is a portable, thread-safe timer • Per thread timing OpenMP lec #2
Timing example tic = omp_get_wtime(); #pragma omp parallel for schedule(static) \ private(sum) nowait for( i = 0; i < N; i++) { sum = sum + a[n]*b[n] } tic = omp_get_wtime() - tic fprint(stdout,”Time for parallel do loop %f\n”,tic) OpenMP lec #2
Code Profiling • Amount of time spent in individual functions • Hardware counters (PAPI) • cache misses • cpu cycles • flops • etc • CrayPAT is the native profiling and code instrumentation tool on the Cray • Use with OpenMP, MPI, and OpenMP + MPI applications • Easy to use and pretty informative OpenMPlec #2
Dot Product cont… • Question - what’s wrong with this code? parameter( CHUNK = 1) real*8 my_sum, a(1:N), b(1:N) !$omp parallel default(shared) private (My_rank) My_rank = omp_get_thread_num() sum(:) = 0. !$omp do schedule(static, CHUNK) reduction(my_sum : +) do n = 1, N my_sum = my_sum + a(n)*b(n) end do !$omp end do !$omp end parallel OpenMPlec #2
Observed Performance OpenMP lec #2
CrayPAT Example - 8 threads • lfs> module load xt-craypat • Rebuild your code with the craypat module loaded • lfs> ftn -o my_program -mp ./my_program.f90 • Instrument the code; produces the instrumented executable my_program+pat • lfs> pat_build -O apamy_program • Run the instrumented code • lfs> aprun -n 1 -d 12 ./my_program+pat • Use pat_report to process the files • lfs> pat_report -T -o report1.txt my_program+pat+PID-nodesdt.xf • Rebuild the program • lfs> pat_build -O my_program+pat+PID-nodesdt.apa • Run the new instrumented executable • lfs> aprun -n 1 -d 12 ./my_program+apa • Use pat_report to process the new data file • lfs> pat_report -T -o report2.txt my_program+apa+PID-nodesdt.xf OpenMP lec #2
Chunksize = 1 • Table 1: Profile by Function Group and Function • Time % | Time |Imb. Time | Imb. | Calls |Group • | | | Time % | | Function • | | | | | Thread='HIDE' • 100.0% | 11.067847 | -- | -- | 40.0 |Total • |--------------------------------------------------------------- • | 100.0% | 11.066535 | -- | -- | 23.0 |USER • ||-------------------------------------------------------------- • || 87.3% | 9.662574 | 0.377535 | 4.5% | 10.0 |MAIN_.LOOP@li.47 • || 12.6% | 1.396741 | 1.222148 | 100.0% | 1.0 |MAIN_ • || 0.1% | 0.007096 | 0.006209 | 100.0% | 1.0 |exit • || 0.0% | 0.000112 | 0.000017 | 17.0% | 10.0 |MAIN_.REGION@li.47 • || 0.0% | 0.000013 | 0.000011 | 100.0% | 1.0 |main • ||============================================================== • | 0.0% | 0.001058 | 0.000926 | 100.0% | 10.0 |OMP • | | | | | | MAIN_.REGION@li.47(ovhd) • | 0.0% | 0.000253 | 0.000222 | 100.0% | 7.0 |PTHREAD • | | | | | | pthread_create • |=============================================================== OpenMP lec #2
Chunksize = 1 - schedule(static, Chunksize) • USER / MAIN_.LOOP@li.47 • ------------------------------------------------------------------------ • Time% 87.3% • Time 9.662574 secs • Imb.Time 0.377535 secs • Imb.Time% 5.1% • Calls 1.0 /sec 10.0 calls • PAPI_L1_DCM 5.647M/sec 54567370 misses • PAPI_TLB_DM 0.364M/sec 3515491 misses • PAPI_L1_DCA 99.942M/sec 965695643 refs • PAPI_FP_OPS 23.286M/sec 225000010 ops • User time (approx) 9.663 secs 25122777671 cycles 100.0%Time • Average Time per Call 0.966257 sec • CrayPat Overhead : Time 0.0% • HW FP Ops / User time 23.286M/sec 225000010 ops 0.2%peak(DP) • HW FP Ops / WCT 23.286M/sec • Computational intensity 0.01 ops/cycle 0.23 ops/ref • MFLOPS (aggregate) 23.29M/sec • TLB utilization 274.70 refs/miss 0.537 avg uses • D1 cache hit,miss ratios 94.3% hits 5.7% misses • D1 cache utilization (misses) 17.70 refs/miss 2.212 avg hits • ======================================================================== OpenMP lec #2
Chunksize = default - schedule(static) • Table 1: Profile by Function Group and Function • Time % | Time |Imb. Time | Imb. | Calls |Group • | | | Time % | | Function • | | | | | Thread='HIDE' • 100.0% | 2.575582 | -- | -- | 40.0 |Total • |-------------------------------------------------------------- • | 99.9% | 2.572372 | -- | -- | 23.0 |USER • ||------------------------------------------------------------- • || 52.0% | 1.340469 | 1.172911 | 100.0% | 1.0 |MAIN_ • || 47.6% | 1.224797 | 0.004662 | 0.4% | 10.0 |MAIN_.LOOP@li.45 • || 0.3% | 0.007000 | 0.006125 | 100.0% | 1.0 |exit • || 0.0% | 0.000091 | 0.000015 | 19.3% | 10.0 |MAIN_.REGION@li.45 • || 0.0% | 0.000015 | 0.000013 | 100.0% | 1.0 |main • ||============================================================= • | 0.1% | 0.002983 | 0.002610 | 100.0% | 10.0 |OMP • | | | | | | MAIN_.REGION@li.45(ovhd) • | 0.0% | 0.000227 | 0.000199 | 100.0% | 7.0 |PTHREAD • | | | | | | pthread_create • |============================================================== OpenMP lec #2
Chunksize = default • USER / MAIN_.LOOP@li.45 • ------------------------------------------------------------------------ • Time% 47.6% • Time 1.224797 secs • Imb.Time 0.004662 secs • Imb.Time% 0.5% • Calls 8.2 /sec 10.0 calls • PAPI_L1_DCM 0.846M/sec 1035963 misses • PAPI_TLB_DM 0.359M/sec 439124 misses • PAPI_L1_DCA 734.818M/sec 900031012 refs • PAPI_FP_OPS 183.698M/sec 225000010 ops • User time (approx) 1.225 secs 3184570214 cycles 100.0%Time • Average Time per Call 0.122480 sec • CrayPat Overhead : Time 0.0% • HW FP Ops / User time 183.698M/sec 225000010 ops 1.8%peak(DP) • HW FP Ops / WCT 183.698M/sec • Computational intensity 0.07 ops/cycle 0.25 ops/ref • MFLOPS (aggregate) 183.70M/sec • TLB utilization 2049.61 refs/miss 4.003 avg uses • D1 cache hit,miss ratios 99.9% hits 0.1% misses • D1 cache utilization (misses) 868.79 refs/miss 108.598 avg hits • ======================================================================== OpenMP lec #2
Conclusion • Clearly chunksize = 1 is a bad choice for this problem • 64 byte cache line • eight threads implies each thread is accessing A and B with stride 8 = 64 bytes • Performance degradation is due to cache thrashing, not false sharing OpenMP lec #2
Parallel efficiency – Jaguar PF OpenMP lec #2
Speedup – Jaguar PF OpenMP lec #2
aprun process and memory affinity flags • Remote NUMA-node memory references and process migration can reduce performance • Keep physical memory as close as possible to threads • aprun provides memory affinity and CPU affinity option • -cc cpu_list | cpu | numa_node • Binds pe’s to CPU’s • -d depth • Specifies the number of CPU’s for each pe and its threads • -S pes_per_numa_node • Specifies the number of pe’s to allocate per NUMA node, may only apply to XT5 see man page… • -ss specifies strict memory containment per NUMA node, when -ss is set a PE can allocate only the memory local to its assigned NUMA node - might be XT5 only see man page OpenMP lec #2
Performance Killers - Things that will degrade the performance of your code • Fine grain parallelism - threads need a lot of work in order to amortize the cost to create, schedule and execute a parallel region • Synchronization - Careful, sometimes synchronization is implicit in OpenMP, e.g. end of parallel do/for loop! • liberal dose when debugging, but the goal in performance tuning will be to figure out how to remove as much as possible • False sharing – more than a single thread accessing the same cache line on a cache coherent system resulting in cache line invalidations => cache misses • Flushing variables for memory consistency - compiler cannot optimize these variables • Serialization – too many atomic updates and/or critical regions and critical regions with too much work • Amdahl’s law - “The speedup of a program is limited by the time needed for the sequential fraction of the program” OpenMP lec #2