Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

Effectively Addressing MemoryDavid Skinner, NERSC Division, Berkeley Lab

Abstract: A demonstration of how loop control structures impact memory bandwidth and program performance is presented. The performance of various loops and memory addressing idioms in different languages with varying strides and access patterns is examined. The focus is on making clean looking code perform well.

CPU and Memory Imbalance • John McCalpin points out with the Streams Benchmark, CPUs are fast outpacing memory. • The megahertz war in commodity computing impacts HPC. • Main memory access is often the bottleneck to performance

The Streams Benchmark • Should you expect to see the stream memory bandwidth in your code? • How drastically are you willing to modify your code in improve memory access rates? • Good main memory bandwidth at the machine level is a necessary but not sufficient condition for good program performance. Need good algorithms and compilers too.

SMP’s themselves are increasingly unbalanced

Hardware Overview

Performance at what cost? SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar DO i=1,n a(i) = b(i) + scalar*c(i) END DO END SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar INTEGER AHEAD PARAMETER (AHEAD=128) IF (n.LE.AHEAD+1) THEN DO j = 1,n a(j) = b(j) + scalar*c(j) END DO ELSE DO j = 1,n-AHEAD,16 !IBM* CACHE_ZERO (a(j+AHEAD)) DO i=0,15 a(j+i) = b(j+i) + scalar*c(j+i) END DO END DO DO j = n-AHEAD+1,n a(j) = b(j) + scalar*c(j) END DO END IF END Will improving memory bandwidth make my code unintelligible, not portable, more prone to breaking?

Scientific Applications • Linear algebra • Bioinformatics • PDEs • Sparse/Mesh free methods All share certain language constructs All address data structures of appreciable size In what follows we’ll focus on kernels which represent these stanzas, in a variety of programming languages.

Schematic of a Scientific Code • fill • scale • copy • add • daxpy Loop1 Loop2 Loop3 Loop4

Case Study: MCTDH % cumulative self self total time seconds seconds calls ms/call ms/call name 28.9 20.03 20.03 229 87.47 87.47 .qtxxzz [9] 20.8 34.39 14.36 200 71.80 71.80 .qtxxzza [12] 15.3 44.95 10.56 8534 1.24 1.24 .cpvxz [20] 5.8 48.99 4.04 4574 0.88 0.88 .qtxxdz [32] 4.4 52.01 3.02 1038 2.91 2.91 .zerovxz [36] 4.0 54.80 2.79 42 66.43 66.43 .rmmxxxzz [38] 2.6 56.57 1.77 92 19.24 19.24 .xvxxzza [42] 2.0 57.96 1.39 408 3.41 3.41 .mmaxzz [47] 1.7 59.17 1.21 176 6.88 6.88 .mattens [49] 1.3 60.08 0.91 24 37.92 37.92 .rm1hxxxzz [53] 1.3 60.99 0.91 3 303.33 1884.19 .csilstep [31] 0.9 61.61 0.62 176 3.52 3.52 .mqxxzz [61] 0.9 62.22 0.61 176 3.47 3.47 .mqxtzz [62] 0.8 62.80 0.58 2741 0.21 0.21 .vvaxzz [65] 0.8 63.34 0.54 1919 0.28 0.28 .addmxxzo [67] 0.7 63.84 0.50 790 0.63 0.63 .zeromxz [68] 0.5 64.17 0.33 184 1.79 1.79 .overmxz [74] 0.5 64.49 0.32 .IOWrite [75] Fill Copy Scale Add Triad

Direct for(i=0;i<n;i++) { a[i] = b[i] } do i=1,n a(i) = b(i) enddo Indirect Gather: a[i] = b[ib[I]] Scatter: a[ia[I]] = b[I] Gascat: a[ia[I]] = b[ib[I]] Loop Constructs • Strided • for(i=0;i<n;i+=stride) { • a[i] = b[i] • } • do i=1,n,stride • a(i) = b(i) • enddo • Multidimensional • do k = 1,m • do j = 1, n • do i = 1,o • a(i,j,k) = b(i,j,k) • …

libc <string.h> fill: memset, bzero copy: memmove, memcpy BLAS copy : dcopy scale: dcopy,dscal add: dcopy,daxpy triad: dcopy, daxpy Fortran 90 Intrinsics copy: a(1:n:stride) = b(1:n:stride) triad: a = b + s*c STL vector<double> a,b,c; fill: fill(a.begin(), a.end(), scalar) copy: copy(b.begin(), b.end(),a.begin()) add: transform(b.begin(),b.end(), c.begin(),a.begin(), plus<double>()) Alternatives to Loops There’s definitely more than one way to do it.

Implementation of Tests s00513>./xtream -h usage: xtream [options] ndim dim1 [dim2 ...] [inc1 inc2 ...] -stride n uses stride n -libc 1d bzero,memset,memcpy,memmove ops -blas 1d blas ops dcopy, dscal, daxpy -stl STL algorithms -scatter a(ia(i)) = b(i) -gather a(i) = b(ib(i)) -gascat a(ia(i)) = b(ib(i)) -null check timers with a null op -nit n average results over n iterations -rate show MB/sec results instead of times -scan increase sizes as dim(i) += inc(i) -scanx increase sizes as dim(i) += dim(i)/inc(i)+1

Example s00513>./xtream 2 256 256 host s00513 006006564C00/AIX word=8 (Mon Dec 16 08:16:03 2002) cmd "./xtream 2 256 256 " dimension 65536 2 { 256 256 } nit = 128 mb = 5.000e-01 construct N t_fill t_copy t_scale t_add t_triad c_for,1d 65536 4.749e-04 5.260e-04 4.758e-04 1.896e-03 1.617e-03 f_do,1d 65536 4.745e-04 5.258e-04 4.777e-04 1.846e-03 1.846e-03 c++_stl,1d 65536 5.621e-04 4.840e-04 4.864e-04 7.353e-03 7.834e-03 c_blas 65536 0.000e+00 3.851e-04 3.472e-04 8.185e-04 8.151e-04 c_for 65536 4.752e-04 5.209e-04 4.543e-04 1.870e-03 1.861e-03 c++_for 65536 4.748e-04 5.207e-04 4.494e-04 1.858e-03 1.874e-03 f_do 65536 4.728e-04 5.228e-04 4.734e-04 5.228e-04 1.848e-03 f90_intr 65536 4.733e-04 5.241e-04 4.752e-04 1.830e-03 1.835e-03 wallclock 65536 6.323e+00 sec A lot of information for the programmer!

Scanning over problem size (copy)

Compiler Options • By default IBM’s compilers provide no optimization • By default you get only 1 (256 MB) memory segment • (use –bmaxdata 0x7000000 to get more) • Optimization Levels • none • -O2 • -O3 –qstrict –qarch=auto –qtune=auto

-O2

-O3 –qstrict –qtune=pwr3 –qarch=pwr3

Misalignment Has xlf ever told you? 1514-008 (W) Variable mres is misaligned. This may affect the efficiency of the code. 2000x2000 ESSL-SMP DGEMMs Mflip/s % Efficiency Misalignment 5304 100 None; just real*8 arrays in common 4482 85 4-bytes; integer as first item in common 4317 81 4-bytes; character*4 as first item in common 397 7 1-byte; character*1 as first item in common 397 7 2-bytes; character*2 as first item in common 397 7 3-bytes; character*3 as first item in common

Data Locality

Indirect Addressing

Multidimensional Loops and Loop Overhead

Multidimensional Loops

References • Stream Benchmark • OOPack Benchmark • Stepanov C/C++ Benchmarks • Haney Kernels

Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

Presentation Transcript

Eric Linder Berkeley Lab UC Berkeley

Writing, Running and Tuning MPI Codes on the IBM SP David Skinner, NERSC, Berkeley Lab

MEMORY ORGANIZTION ADDRESSING

Berkeley Lab

OS Memory Addressing

Lawrence Berkeley National Lab

* Memory Solutions Lab. (MSL) Memory Division, Samsung Electronics Co.

OS Memory Addressing

Web Services for NGF Access David Skinner deskinner@lbl NERSC User Group Meeting

Berkeley Lab Engineering Division All-Hands Meeting

Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

CONTENT ADDRESSING MEMORY

HEPiX Fall 2001 Report NERSC, Berkeley

NERSC Software Roadmap David Skinner, NERSC Division, Berkeley Lab

Memory Addressing

Berkeley Lab Overview

Lawrence Berkeley National Lab

MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10

Memory Addressing Techniques

David Culler* UC Berkeley Intel Research @ Berkeley

David E. Culler Computer Science Division University of California, Berkeley