html5-img
1 / 26

Data Locality

Data Locality. CS 524 – High-Performance Computing. Motivating Example: Vector Multiplication. 1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs) 100 ns memory latency, no cache, 1 word transfer size Peak performance: 4/1 ns = 4 GFLOPS

hadar
Download Presentation

Data Locality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Locality CS 524 – High-Performance Computing

  2. Motivating Example: Vector Multiplication • 1 GHz processor (1 ns clock), 4 instructions per cycle, 2 multiply-add units (4 FPUs) • 100 ns memory latency, no cache, 1 word transfer size • Peak performance: 4/1 ns = 4 GFLOPS • Performance on matrix-vector multiply • After every 200 ns 2 floating point operations occur • Performance = 1 / 100 ns = 10 MFLOPS CS 524 (Au 05-06)- Asim Karim @ LUMS

  3. Motivating Example: Vector Multiplication • Assume there is a 32 K cache with a 1 ns latency. 1 word cache line, optimal replacement policy • Multiply 1 K vectors • No. of floating point operations = 2 K (2n: n = size of vector) • Latency to fill cache = 2 K * 100 ns = 200 micro sec • Time to execute = 2 K / 4 * 1 ns = 0.5 micro sec • Performance = 2 K / 200.5 = about 10 MFLOPS • No increase in performance! Why? CS 524 (Au 05-06)- Asim Karim @ LUMS

  4. Motivating Example: Matrix-Matrix Multiply • Assume the same memory system characteristics on the previous two slides • Multiply two 32 x 32 matrices • No. of floating point operations = 64 K (2n3: n = dim of matrix) • Latency to fill cache = 2 K * 100 ns = 200 micro sec • Time to execute = 64 K / 4 * 1 ns = 16 micro sec • Performance = 64 K / 216 = about 300 MFLOPS • Repeated use of data in cache helps improve performance: temporal locality CS 524 (Au 05-06)- Asim Karim @ LUMS

  5. Motivating Example: Vector Multiply • Assume same memory system characteristics on first slide (no cache) • Assume 4 word transfer size (cache line size) • Multiply two vectors • Each access brings in four words of a vector • In 200 ns, four words of each vector are brought in • 8 floating point operations can be performed in 200 ns • Performance: 8 / 200 ns = 40 MFLOPS • Memory increased by the effective increase in memory bandwidth • Spatial locality CS 524 (Au 05-06)- Asim Karim @ LUMS

  6. Locality of Reference • Application data space is often too large to all fit into cache • Key is locality of reference • Types of locality • Register locality • Cache-temporal locality • Cache-spatial locality • Attempt (in order of preference) to: • Keep all data in registers • Order algorithms to do all ops on a data element before it is overwritten in cache • Order algorithms such that successive ops are on physically contiguous data CS 524 (Au 05-06)- Asim Karim @ LUMS

  7. Single Processor Performance Enhancement • Two fundamental issues • Adequate op-level parallelism (to keep superscalar, pipelined processor units busy) • Minimize memory access costs (locality in cache) • Three useful techniques to improve locality of reference (loop transformations) • Loop permutation • Loop blocking (tiling) • Loop unrolling CS 524 (Au 05-06)- Asim Karim @ LUMS

  8. Access Stride and Spatial Locality a b c a d g • Access stride: separation between successively accessed memory locations • Unit access stride maximizes spatial locality (only one miss per cache line) • 2-D arrays have different linearized representations in C and FORTRAN • FORTRAN’s column-major representation favors column-wise access of data • C’s representation favors row-wise access of data a b c d e f Row major order (C ) g h i Column major order (FORTRAN) CS 524 (Au 05-06)- Asim Karim @ LUMS

  9. Matrix-Vector Multiply (MVM) j j i y A x do i = 1, N do j = 1, N y(i) = y(i)+A(i,j)*x(j) end do end do • Total no. of references = 4N2 • Two questions: • Can we predict the miss ratio of different variations of this program for different cache models? • What transformations can we do to improve performance? CS 524 (Au 05-06)- Asim Karim @ LUMS

  10. Cache Model • Fully associative cache (so no conflict misses) • LRU replacement algorithm (ensures temporal locality) • Cache size • Large cache model: no capacity misses • Small cache model: miss depending on problem size • Cache block size • 1 floating-point number • b floating-point numbers CS 524 (Au 05-06)- Asim Karim @ LUMS

  11. MVM: Dot-Product Form (Scenario 1) • Loop order: i-j • Cache model • Small cache: assume cache can hold fewer than (2N+1) numbers • Cache block size: 1 floating-point number • Cache misses • Matrix A: N2 misses (cold misses) • Vector x: N cold misses + N(N-1) capacity misses • Vector y: N cold misses • Miss ratio = (2N2 + N)/4N2→ 0.5 CS 524 (Au 05-06)- Asim Karim @ LUMS

  12. MVM: Dot-Product Form (Scenario 2) • Cache model • Large cache: cache can hold more than (2N+1) numbers • Cache block size: 1 floating-point number • Cache misses • Matrix A: N2 cold misses • Vector x: N cold misses • Vector y: N cold misses • Miss ratio = (N2 + 2N)/4N2→ 0.25 CS 524 (Au 05-06)- Asim Karim @ LUMS

  13. MVM: Dot-Product Form (Scenario 3) • Cache block size: b floating-point numbers • Matrix access order: row-major access (C) • Cache misses (small cache) • Matrix A: N2/b cold misses • Vector x: N/b cold misses + N(N-1)/b capacity misses • Vector y: N cold misses • Miss ratio = (1/2b +1/4N) → 1/2b • Cache misses (large cache) • Matrix A: N2/b cold misses • Vector x: N/b cold misses • Vector y: N/b cold misses • Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b CS 524 (Au 05-06)- Asim Karim @ LUMS

  14. MVM: SAXPY Form do j = 1, N do i = 1, N y(i) = y(i) +A(i,j)*x(j) end do end do Total no. of references = 4N2 j j i A y x CS 524 (Au 05-06)- Asim Karim @ LUMS

  15. MVM: SAXPY Form (Scenario 4) • Cache block size: b floating-point numbers • Matrix access order: row-major access (C) • Cache misses (small cache) • Matrix A: N2 cold misses • Vector x: N cold misses • Vector y: N/b cold misses + N(N-1)/b capacity misses • Miss ratio = 0.25(1 +1/b) + 1/4N → 0.25(1 + 1/b) • Cache misses (large cache) • Matrix A: N2 cold misses • Vector x: N/b cold misses • Vector y: N/b cold misses • Miss ratio = 1/4 +1/2Nb → 1/4 CS 524 (Au 05-06)- Asim Karim @ LUMS

  16. Loop Blocking (Tiling) • Loop blocking enhances temporal locality by ensuring that data remain in cache until all operations are performed on them • Concept • Divide or tile the loop computation into chunks or blocks that fit into cache • Perform all operations on block before moving to the next block • Procedure • Add outer loop(s) to tile inner loop computation • Select size of the block so that you have a large cache model (no capacity misses, temporal locality maximized) CS 524 (Au 05-06)- Asim Karim @ LUMS

  17. MVM: Blocked do bi = 1, N/B do bj = 1, N/B do i = (bi-1)*B+1, bi*B do j = (bj-1)*B+1, bj*B y(i) = y(i) + A(i, j)*x(j) x j B B i y A CS 524 (Au 05-06)- Asim Karim @ LUMS

  18. MVM: Blocked (Scenario 5) • Cache size greater than 2B + 1 floating-point numbers • Cache block size: b floating-point numbers • Matrix access order: row-major access (C) • Cache misses per block • Matrix A: B2/b cold misses • Vector x: B/b • Vector y: B/b • Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b • Lesson: proper loop blocking eliminates capacity misses. Performance essentially becomes independent of cache size (as long as cache size is > 2B +1 fp numbers) CS 524 (Au 05-06)- Asim Karim @ LUMS

  19. Access Stride and Spatial Locality a b c a d g • Access stride: separation between successively accessed memory locations • Unit access stride maximizes spatial locality (only one miss per cache line) • 2-D arrays have different linearized representations in C and FORTRAN • FORTRAN’s column-major representation favors column-wise access of data • C’s representation favors row-wise access of data a b c d e f Row major order (C ) g h i Column major order (FORTRAN) CS 524 (Au 05-06)- Asim Karim @ LUMS

  20. Access Stride Analysis of MVM c Dot-product form do i = 1, N do j = 1, N y(i) = y(i) + A(i,j)*x(j) end do end do c SAXPY form do j = 1, N do i = 1, N y(i) = y(i) + A(i,j)*x(j) end do end do CS 524 (Au 05-06)- Asim Karim @ LUMS

  21. Loop Permutation • Changing the order of nested loops can improve spatial locality • Choose the loop ordering that minimizes miss ratio, or • Choose the loop ordering that minimizes array access strides • Loop permutation issues • Validity: Loop reordering should not change the meaning of the code • Performance: Does the reordering improve performance? • Array bounds: What should be the new array bounds? • Procedure: Is there a mechanical way to perform loop permutation? These are of primary concern in compiler design. We will discuss some of these issues when we cover dependences, from an application software developer’s perspective. CS 524 (Au 05-06)- Asim Karim @ LUMS

  22. Matrix-Matrix Multiply (MMM) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) C[i][j] = C[i][j] + A[k][i]*B[k][j]; } } CS 524 (Au 05-06)- Asim Karim @ LUMS

  23. Loop Unrolling • Reduce number of iterations of loop but add statement(s) to loop body to do work of missing iterations • Increase amount of operation-level parallelism in loop body • Outer-loop unrolling changes the order of access of memory elements and could reduce number of memory accesses and cache misses do i = 1, 2n do j = 1, m loop-body(i,j) end do do i = 1, 2n, 2 do j = 1, m loop-body(i,j) loop-body(i+1, j) End do CS 524 (Au 05-06)- Asim Karim @ LUMS

  24. MVM: Dot-Product 4-Outer Unrolled Form /* Assume n is a multiple of 4 */ for (i = 0; i < n; i+=4) { for (j = 0; j < n; j++) { y[i] = y[i] + A[i][j]*x[j]; y[i+1] = y[i+1] + A[i+1][j]*x[j]; y[i+2] = y[i+2] + A[i+2][j]*x[j]; y[i+3] = y[i+3] + A[i+3][j]*x[j]; } } CS 524 (Au 05-06)- Asim Karim @ LUMS

  25. MVM: SAXPY 4-Outer Unrolled Form /* Assume n is a multiple of 4 */ for (j = 0; j < n; j+=4) { for (i = 0; i < n; i++) { y[i] = y[i] + A[i][j]*x[j] + A[i][j+1]*x[j+1] + A[i][j+2]*x[j+2] + A[i][j+3]*x[j+3]; } } CS 524 (Au 05-06)- Asim Karim @ LUMS

  26. Summary • Loop permutation • Enhances spatial locality • C and FORTRAN compilers have different access patterns for matrices resulting in different performances • Performance increases when inner loop index has the minimum access stride • Loop blocking • Enhances temporal locality • Ensure that 2B is less than cache size • Loop unrolling • Enhances superscalar, pipelined operations CS 524 (Au 05-06)- Asim Karim @ LUMS

More Related