1 / 30

Cache Performance Analysis

Hitting for performance. Cache Performance Analysis. Standard Matrix Multiplication. for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b[k,j]);

Download Presentation

Cache Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hitting for performance Cache Performance Analysis CMPUT 229

  2. CMPUT 229 Standard Matrix Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b[k,j]); sum  sum + temp1*temp2; } store(c[i,j]) sum; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ c[i,j] = 0.0; for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

  3. CMPUT 229 Data Access Pattern A B

  4. CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

  5. CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache lines 128-byte cache line 8-byte element = 16 elements/line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

  6. CMPUT 229 # hits 960 hits Hit ratio = = # of accesses 2048 accesses 256 lines/cache 16 elements/line Cache Data Access Pattern If we ignore conflict misses, then: Every 16th access of A is a miss; Every access to B is a miss; How many hits and misses will occur to compute one element of C? In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits. In B there will be 1024 misses. Thus, what is the hit ratio? = 0.47 = 47%

  7. CMPUT 229 31 15 14 7 6 0 Tag Index Offset 7 bits 17 bits 8 bits 256 lines/cache 16 elements/line Address anatomy The data cache has 32 Kbytes and 128-byte cache lines; 128 = 27 256 = 28

  8. CMPUT 229 256 lines/cache 16 elements/line Conflict Misses Cache Access Address Index Outcome A[0,0] $80000000 0 miss B[0,0] $80800000 0 miss A[0,1] $80000004 0 miss B[1,0] $80801000 32 miss A[0,2] $80000008 0 hit B[2,0] $80802000 64 miss A[0,3] $8000000C 0 hit B[3,0] $80803000 96 miss A[0,4] $80000010 0 hit B[4,0] $80804000 128 miss A[0,5] $80000014 0 hit B[5,0] $80805000 160 miss A[0,6] $80000018 0 hit B[6,0] $80806000 192 miss A[0,7] $8000001C 0 hit B[7,0] $80807000 244 miss A[0,8] $80000020 0 hit B[8,0] $80808000 0 miss A[0,9] $80000024 0 miss B[9,0] $80809000 32 miss 0  32  64 In General: A 1024-element row of A Occupies 64 16-element cache lines. There will be 2 conflict misses in two of these rows. A total of 4 conflict misses per row. Thus the accesses of A will result in 68 misses and 986 hits for each 1024 accesses. The conflict misses are not significant and can be ignored.  96  128  160  192  244 

  9. CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b1[j,k]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program? Where in memory should we place matrix b1 to reduce conflict misses?

  10. CMPUT 229 31 15 14 7 6 0 Tag Index Offset Where to place matrix b1? Intuitively the index of b1[0][0] should be away from the index of a[0][0]. The index of a[0][0] is 0. Thus we could aim to place b1 at an address whose index is 128.

  11. CMPUT 229 # hits 960 hits # of accesses 2048 accesses Hit ratio = = Cache Access Pattern for the Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits. And the inner loop is repeated 1024 times: 1024  2048 accesses 1024 960 hits Thus, the hit ratio is: = 0.47 = 47%

  12. CMPUT 229 Cache Access Pattern for the Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b1[j,k]); sum  sum + temp1*temp2; } store(c[i,j]) sum; } } If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits. The inner loop is executed n2 times. The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits

  13. CMPUT 229 1024  960+ 1024  1024  1920 hits Hit ratio = 2048  1024 + 1024  1024  2048 accesses 960+ 1024  1920 hits Hit ratio = 1025  2048 accesses Hit Ratio for Multiplication with Transpose The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits The transpose yields: 1024  2048 accesses 1024  960 hits. = 0.937 = 93.7%

  14. CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } } • Code adapted from http://www.netlib.org/utk/papers/autoblock/node2.html Assumes that all elements of matrix c were initialized to zero beforehand

  15. CMPUT 229 miss 2 Data Access Pattern hit 0 A B

  16. CMPUT 229 miss 3 Data Access Pattern hit 1 A B

  17. CMPUT 229 miss 4 Data Access Pattern hit 2 A B

  18. CMPUT 229 miss 4 Data Access Pattern hit 4 A B

  19. CMPUT 229 miss 4 Data Access Pattern hit 6 A B

  20. CMPUT 229 miss 4 Data Access Pattern hit 8 A B

  21. CMPUT 229 miss 4 Data Access Pattern hit 10 A B

  22. CMPUT 229 miss 4 Data Access Pattern hit 12 A B

  23. CMPUT 229 miss 4 Data Access Pattern hit 14 A B Multiplying the first row of the block of A by the block of B required 18 accesses that resulted in 4 misses. How many of the 18 accesses required to multiply the second row of the block of A by the block of B will be misses?

  24. CMPUT 229 miss 4|1 Data Access Pattern hit 14|1 A B

  25. CMPUT 229 miss 4|1 Data Access Pattern hit 14|17 A B

  26. CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B

  27. CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 2b3 - b Hit ratio = 3 hits and 48 references 2b3 In general, there are b misses and 2b3 accesses

  28. CMPUT 229 2b2 - 1 Hit ratio = 2b2 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 3 hits and 48 references In general, there are b misses and 2b3 accesses

  29. CMPUT 229 miss Data Access Pattern hit A B Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; What should be the value of b? Do the memory locations of A and B matter?

  30. CMPUT 229 2b2 - 1 2(16)2 - 1 Hit ratio = Hit ratio = 2b2 2(16)2 Cache Usage for Blocked Matrix Multiplication for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } }} } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; Ignore conflict misses. Estimate the hit ratio for the block computation if b=16. Hit ratio = 99.8%

More Related