Applying Data Copy To Improve Memory Performance of General Array Computations

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio

Improving Memory Performance processor processor Registers Registers Cache Locality optimizations Cache Memory Memory Shared memory Parallelization Network connection Communication optimizations

Compiler Optimizations For Locality • Computation optimizations • Loop blocking, fusion, unroll-and-jam, interchange, unrolling • Rearrange computations for better spatial and temporal locality • Data-layout optimizations • Static layout transformation • A single layout for entire application • No additional overhead, tradeoff between different layout choices • Dynamic layout transformation • A different layout for each computation phase • Flexible but could be expensive • Combining computation and data transformations • Static layout transformation • Transform layout first, then computation • Dynamic layout transformation • Transform computation first, then dynamic re-arrange layout

Array Copying --- Related Work • Dynamic layout transformation for arrays • Copy arrays into local buffers before computation • Copy modified local buffers back to array • Previous work • Lam, Rothberg and Wolf, Temam, Granston and Jalby • Copy arrays after loop blocking • Assume arrays are accessed through affine expressions • Array copy always safe • Static data layout transformations • A single layout throughout the application --- always safe • Optimizing irregular applications • Data access patterns not known until runtime • Dynamic layout transformation --- through libraries • Scalar Replacement • Equivalent to copying single array element into scalars • Carr and Kennedy: applied to inner loops • Question: how expensive is array copy? How difficult is it?

What is new • Array copy for arbitrary loop computations • Stand-alone optimization independent of blocing • Works for computations with non-affine array access • General array copy algorithm • Work on arbitrarily shaped loops • Automatic insert copy operations to ensure safety • Heuristics to reduce buffer size and copy cost • Performs scalar replacement when specialized • Applications • Improve cache and register locality • When combined with blocking and when without blocking • Future work • Communication and parallelization • Empirical tuning of optimizations • Interface that allows different application heuristics

Apply Array Copy: Matrix Multiplication for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Step1: build dependence graph • True, output, anti and input deps between array accesses • Is each dependence precise? C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]

Apply Array Copy: Matrix Multiplication for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Step2-3: Separate array references connected by imprecise deps • Impose an order on all array references • C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W) • Remove all back edges • Apply typed fusion C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]

Array Copy with Imprecise Dependences for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Array references connected by imprecise deps • Cannot precisely determine a mapping between subscripts • Sometimes may refer to the same location, sometimes not • Not safe to copy into a single buffer • Apply typed-fusion algorithm to separate them C[f(i,j,m)] C[i+j*m] A[i+k*m] B[k+j*l] Imprecise

Profitability of Applying Array Copy location to copy A for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; • Each buffer should be copied at most twice (splitting groups) • Determine outermost location to perform copy • No interference with other groups • Enforce size limit on the buffer • constant size => scalar replacement • Ensure reuse of copied elements • Lowering position to copy if no extra reuse is gained location to copy C location to copy B k i j k C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l] k k

Array Copy Result: Matrix Multiplication Copy A[0:m,0:l] to A_buf; for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m]; } • Dimensionality of buffer enforced by command-line options • Can be applied to arbitrarily shaped loop structures • Can be applied independently or after blocking

Experiments • Implementation • Loop transformation framework by Yi, Kennedy and Adve • ROSE compiler infrastructure at LLNL (Quinlan et al) • Benchmarks • dgemm (matrix multiplication, LAPACK) • dgetrf (LU factorization with partial pivoting, LAPACK) • tomcatv (mesh generation with Thompson solver, SPEC95) • Machines • A Dell PC with two 2.2GHz Intel XEON processors • 512KB cache on each processor and 2GB memory • A SGI workstation with a 195 MHz R10000 processor • 32KB 2-way L1, 1MB 4-way L2, and 256MB memory; • A single 8-way P655+ node on a IBM terascale machine • 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 • 16GB memory

DGEMM on Dell PC

DGEMM on SGI Workstation

DGEMM on IBM

Summary • When should we apply scalar replacement? • Profitable unless register pressure too high • 3-12% improvement observed • No overhead • When should we apply array copy? • When regular conflict misses occur • When prefetching of array elements is need • 10-40% improvement observed • Overhead is 0.5-8% when not beneficial • Optimizations not beneficial on the IBM node • Both blocking and 2-dim copying not profitable • Integer operation overhead too high • Will be further investigated

Applying Data Copy To Improve Memory Performance of General Array Computations

Applying Data Copy To Improve Memory Performance of General Array Computations

Presentation Transcript

Using Data to Improve Quality and Performance

Applying Stochastic Programs to Improve Investor Performance

Using Program Data to Improve APS Performance

How to improve your memory

Performance of station array configurations

Using Data to Improve Dispute Resolution System Performance

Application of Satellite Data to Improve Model Performance and Evaluation

Use of SSCA data to improve performance

Chapter 5 - General Case Studies: Applying Data Mining to Scientific Data

Using Uncacheable Memory to Improve Unity Linux Performance

How to Improve memory

USING DATA TO IMPROVE PERFORMANCE

Caching to improve performance

Memory-Constrained Communication Minimization for a Class of Array Computations

Using Program Data to Improve APS Performance

How To Improve Your Memory, Ways To Improve Your Memory, How To Increase Your Memory Power

Techniques to improve your Memory

Using Program Data to Improve APS Performance

Use of SSCA data to improve performance

Multi-objective Optimization of Sparse Array Computations

Ways To Improve Memory

Tweet Copy Tips in 2020 to Improve Post Performance