- 51 Views
- Uploaded on
- Presentation posted in: General

Data Locality & ITs Optimization Techniques

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data Locality & ITsOptimizationTechniques

Presented by

Preethi Rajaram

CSS 548 Introduction to Compilers

Professor Carol Zander

Fall 2012

- Processor Speed - increasing at a faster rate than the memory speed
- Computer Architectures -more levels of cache memory
- Cache - takes advantage of data locality
- Good Data Locality - good application performance
- Poor Data Locality - reduces the effectiveness of the cache

- It is the property that, references to the same memory location or adjacent locations are reused within a short period of time
- Temporal locality
- Spatial locality
Fig: Program to find the squares of the differences (a) without loop fusion (b) with loop fusion

[Image from: The Dragon book 2ndedition]

Fig: Basic Matrix Multiplication Algorithm

[Image from: The Dragon book 2ndedition]

- Poor data locality
- N2 multiply add operations separates the reuse of same data element in matrix Y
- N operations separate the reuse of same cache line in Y
- Solutions
- Changing the layout of the data structures
- Blocking

- Changing the data structure layout
- Store Y in column-major order
- Improves reuse of cache lines of matrix Y
- Limited Applicability

- Blocking
- Changes the execution order of instructions
- Divide the matrix into submatrices or blocks
- Order the operations such that entire block is used over a short period of time
- Choose B such that, one block from each of the matrices fits into cache

- Image from: The Dragon book 2nd edition

- Locality Optimization
- Identify set of iterations that access the same data or same cache line
- Static Access- an instruction in a program e.g x = z[i,j]
- Dynamic Access- execution of instruction many times as in a loop nest
- Types of Reuse
- Self
- Iterations using same data come from same static access

- Group
- Iterations using same data come from different static access

- Temporal
- If the same exact location is referenced

- Spatial
- If the same cache line is referenced

- Self

- Save substantial memory by exploiting self reuse
- n(d-k) times reused for data with ‘k’ dimensions in a loop nest of depth ‘d’
e.g. 3-deep nested loop accesses one column of an array, then there is a potential saving accesses of n2 accesses

- Dimensionality of access- Rank of the matrix in access
- Iterations referring to the same location – Null Space of a matrix
- Rank of a Matrix
- No. of rows or columns that are linearly independent

- Null Space of a matrix
- A reference in ‘d’ deep loop nest with ‘r’ rank, accesses O(nr) data elements in O(nd) iterations, so on an average, O(nd-r) iterations must refer to the same array element

Nullity = 3-2 = 1

Loop depth = 3

Rank = 2

Rank = Dimensionality = 2

2nd row = 1st + 3rd

4th row = 3rd – 2* 1st

- Depends on data layout of the matrix – e.g. Row major order
- In an array of ‘d’ dimension, array elements share a cache line if they differ only in the last dimension
e.g. Two array elements share the same cache line if and only if they share the same row in a 2-D array

- Truncated matrix is obtained by dropping of the last row from the matrix
- If the resulting matrix has a rank ‘r’ that is less than depth ‘d’, we can assure for spatial reuse

Truncated Matrix, r = 1, d = 2

r<d, assures spatial reuse

- Group reuse only among accesses in a loop sharing the same coefficient matrix
Fig: 2-deep loop nest

[Image from: The Dragon book 2ndedition]

- z[i,j] and z[i-1,j] access almost the same set of array elements
- Data read by access z[i-1,j] is same as the data written by z[i,j], except for i = 1

Rank = 2, no self temporal reuseTruncated Matrix, Rank = 1,

self spatial reuse

- Temporal Locality of data
Use the results as soon as they are generated

Fig: Code excerpt for a multigrid algorithm (a) before partition (b) after patition

[Image from: The Dragon book 2ndedition]

- Array Contraction
Reduce the dimension of the array and reduce the number of memory locations accessed

Fig: Code excerpt for a multigrid algorithm after partition and after array contraction

Image from: The Dragon book 2nd edition

- Instead of executing each partition one after the other; we interleave a number of the partitions so that reuse among partitions occur close together
- Interleaving Inner Loops in a Parallel Loop
- Interleaving Statements in a Parallel Loop

Fig: Interleaving four instances of the inner loop

[Image from: The Dragon book 2ndedition]

Fig: The statement interleaving transformation

[Image from: The Dragon book 2ndedition]

- Wolf, Michael E., and Monica S. Lam. "A data locality optimizing algorithm." ACM Sigplan Notices 26.6 (1991): 30-44.
- McKinley, Kathryn S., Steve Carr, and Chau-Wen Tseng. "Improving data locality with loop transformations." ACM Transactions on Programming Languages and Systems (TOPLAS) 18.4 (1996): 424-453.
- Bodin, François, et al. "A quantitative algorithm for data locality optimization." Code Generation: Concepts, Tools, Techniques (1992): 119-145.
- Kennedy, Ken, and Kathryn S. McKinley. "Optimizing for parallelism and data locality." Proceedings of the 6th international conference on Supercomputing. ACM, 1992.
- Compilers ‐ Principles, Techniques, and Tools by A. Aho, M. Lam (2nd edition), R. Sethi, and J.Ullman, Addison‐Wesley.

Thank You!

Questions??