Data Locality & ITs Optimization Techniques

Data Locality & ITsOptimizationTechniques Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012

Why? • Processor Speed - increasing at a faster rate than the memory speed • Computer Architectures -more levels of cache memory • Cache - takes advantage of data locality • Good Data Locality - good application performance • Poor Data Locality - reduces the effectiveness of the cache

Data Locality • It is the property that, references to the same memory location or adjacent locations are reused within a short period of time • Temporal locality • Spatial locality Fig: Program to find the squares of the differences (a) without loop fusion (b) with loop fusion [Image from: The Dragon book 2ndedition]

Matrix Multiplication - Example Fig: Basic Matrix Multiplication Algorithm [Image from: The Dragon book 2ndedition] • Poor data locality • N2 multiply add operations separates the reuse of same data element in matrix Y • N operations separate the reuse of same cache line in Y • Solutions • Changing the layout of the data structures • Blocking

Matrix Multiplication – Example Contd… • Changing the data structure layout • Store Y in column-major order • Improves reuse of cache lines of matrix Y • Limited Applicability • Blocking • Changes the execution order of instructions • Divide the matrix into submatrices or blocks • Order the operations such that entire block is used over a short period of time • Choose B such that, one block from each of the matrices fits into cache • Image from: The Dragon book 2nd edition

Data Reuse • Locality Optimization • Identify set of iterations that access the same data or same cache line • Static Access- an instruction in a program e.g x = z[i,j] • Dynamic Access- execution of instruction many times as in a loop nest • Types of Reuse • Self • Iterations using same data come from same static access • Group • Iterations using same data come from different static access • Temporal • If the same exact location is referenced • Spatial • If the same cache line is referenced

Self Temporal Reuse • Save substantial memory by exploiting self reuse • n(d-k) times reused for data with ‘k’ dimensions in a loop nest of depth ‘d’ e.g. 3-deep nested loop accesses one column of an array, then there is a potential saving accesses of n2 accesses • Dimensionality of access- Rank of the matrix in access • Iterations referring to the same location – Null Space of a matrix • Rank of a Matrix • No. of rows or columns that are linearly independent • Null Space of a matrix • A reference in ‘d’ deep loop nest with ‘r’ rank, accesses O(nr) data elements in O(nd) iterations, so on an average, O(nd-r) iterations must refer to the same array element Nullity = 3-2 = 1 Loop depth = 3 Rank = 2 Rank = Dimensionality = 2 2nd row = 1st + 3rd 4th row = 3rd – 2* 1st

Self Spatial Reuse • Depends on data layout of the matrix – e.g. Row major order • In an array of ‘d’ dimension, array elements share a cache line if they differ only in the last dimension e.g. Two array elements share the same cache line if and only if they share the same row in a 2-D array • Truncated matrix is obtained by dropping of the last row from the matrix • If the resulting matrix has a rank ‘r’ that is less than depth ‘d’, we can assure for spatial reuse Truncated Matrix, r = 1, d = 2 r<d, assures spatial reuse

Group Reuse • Group reuse only among accesses in a loop sharing the same coefficient matrix Fig: 2-deep loop nest [Image from: The Dragon book 2ndedition] • z[i,j] and z[i-1,j] access almost the same set of array elements • Data read by access z[i-1,j] is same as the data written by z[i,j], except for i = 1 Rank = 2, no self temporal reuseTruncated Matrix, Rank = 1, self spatial reuse

Locality Optimization • Temporal Locality of data Use the results as soon as they are generated Fig: Code excerpt for a multigrid algorithm (a) before partition (b) after patition [Image from: The Dragon book 2ndedition]

Locality Optimization Contd… • Array Contraction Reduce the dimension of the array and reduce the number of memory locations accessed Fig: Code excerpt for a multigrid algorithm after partition and after array contraction Image from: The Dragon book 2nd edition

Locality Optimization Contd… • Instead of executing each partition one after the other; we interleave a number of the partitions so that reuse among partitions occur close together • Interleaving Inner Loops in a Parallel Loop • Interleaving Statements in a Parallel Loop Fig: Interleaving four instances of the inner loop [Image from: The Dragon book 2ndedition] Fig: The statement interleaving transformation [Image from: The Dragon book 2ndedition]

References • Wolf, Michael E., and Monica S. Lam. "A data locality optimizing algorithm." ACM Sigplan Notices 26.6 (1991): 30-44. • McKinley, Kathryn S., Steve Carr, and Chau-Wen Tseng. "Improving data locality with loop transformations." ACM Transactions on Programming Languages and Systems (TOPLAS) 18.4 (1996): 424-453. • Bodin, François, et al. "A quantitative algorithm for data locality optimization." Code Generation: Concepts, Tools, Techniques (1992): 119-145. • Kennedy, Ken, and Kathryn S. McKinley. "Optimizing for parallelism and data locality." Proceedings of the 6th international conference on Supercomputing. ACM, 1992. • Compilers ‐ Principles, Techniques, and Tools by A. Aho, M. Lam (2nd edition), R. Sethi, and J.Ullman, Addison‐Wesley.

Thank You! Questions??

Data Locality & ITs Optimization Techniques