Cache Performance Analysis of Matrix Multiplication in CMPUT.229

Hitting for performance Cache Performance Analysis CMPUT 229

CMPUT 229 Standard Matrix Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b[k,j]); sum  sum + temp1*temp2; } store(c[i,j]) sum; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ c[i,j] = 0.0; for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

CMPUT 229 Data Access Pattern A B

CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache lines 128-byte cache line 8-byte element = 16 elements/line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?

CMPUT 229 # hits 960 hits Hit ratio = = # of accesses 2048 accesses 256 lines/cache 16 elements/line Cache Data Access Pattern If we ignore conflict misses, then: Every 16th access of A is a miss; Every access to B is a miss; How many hits and misses will occur to compute one element of C? In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits. In B there will be 1024 misses. Thus, what is the hit ratio? = 0.47 = 47%

CMPUT 229 31 15 14 7 6 0 Tag Index Offset 7 bits 17 bits 8 bits 256 lines/cache 16 elements/line Address anatomy The data cache has 32 Kbytes and 128-byte cache lines; 128 = 27 256 = 28

CMPUT 229 256 lines/cache 16 elements/line Conflict Misses Cache Access Address Index Outcome A[0,0] $80000000 0 miss B[0,0] $80800000 0 miss A[0,1] $80000004 0 miss B[1,0] $80801000 32 miss A[0,2] $80000008 0 hit B[2,0] $80802000 64 miss A[0,3] $8000000C 0 hit B[3,0] $80803000 96 miss A[0,4] $80000010 0 hit B[4,0] $80804000 128 miss A[0,5] $80000014 0 hit B[5,0] $80805000 160 miss A[0,6] $80000018 0 hit B[6,0] $80806000 192 miss A[0,7] $8000001C 0 hit B[7,0] $80807000 244 miss A[0,8] $80000020 0 hit B[8,0] $80808000 0 miss A[0,9] $80000024 0 miss B[9,0] $80809000 32 miss 0  32  64 In General: A 1024-element row of A Occupies 64 16-element cache lines. There will be 2 conflict misses in two of these rows. A total of 4 conflict misses per row. Thus the accesses of A will result in 68 misses and 986 hits for each 1024 accesses. The conflict misses are not significant and can be ignored.  96  128  160  192  244 

CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b1[j,k]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program? Where in memory should we place matrix b1 to reduce conflict misses?

CMPUT 229 31 15 14 7 6 0 Tag Index Offset Where to place matrix b1? Intuitively the index of b1[0][0] should be away from the index of a[0][0]. The index of a[0][0] is 0. Thus we could aim to place b1 at an address whose index is 128.

CMPUT 229 # hits 960 hits # of accesses 2048 accesses Hit ratio = = Cache Access Pattern for the Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits. And the inner loop is repeated 1024 times: 1024  2048 accesses 1024 960 hits Thus, the hit ratio is: = 0.47 = 47%

CMPUT 229 Cache Access Pattern for the Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b1[j,k]); sum  sum + temp1*temp2; } store(c[i,j]) sum; } } If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits. The inner loop is executed n2 times. The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits

CMPUT 229 1024  960+ 1024  1024  1920 hits Hit ratio = 2048  1024 + 1024  1024  2048 accesses 960+ 1024  1920 hits Hit ratio = 1025  2048 accesses Hit Ratio for Multiplication with Transpose The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits The transpose yields: 1024  2048 accesses 1024  960 hits. = 0.937 = 93.7%

CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } } • Code adapted from http://www.netlib.org/utk/papers/autoblock/node2.html Assumes that all elements of matrix c were initialized to zero beforehand

CMPUT 229 miss 2 Data Access Pattern hit 0 A B

CMPUT 229 miss 4 Data Access Pattern hit 14 A B Multiplying the first row of the block of A by the block of B required 18 accesses that resulted in 4 misses. How many of the 18 accesses required to multiply the second row of the block of A by the block of B will be misses?

CMPUT 229 miss 4|1 Data Access Pattern hit 14|1 A B

CMPUT 229 miss 4|1 Data Access Pattern hit 14|17 A B

CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B

CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 2b3 - b Hit ratio = 3 hits and 48 references 2b3 In general, there are b misses and 2b3 accesses

CMPUT 229 2b2 - 1 Hit ratio = 2b2 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 3 hits and 48 references In general, there are b misses and 2b3 accesses

CMPUT 229 miss Data Access Pattern hit A B Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; What should be the value of b? Do the memory locations of A and B matter?

CMPUT 229 2b2 - 1 2(16)2 - 1 Hit ratio = Hit ratio = 2b2 2(16)2 Cache Usage for Blocked Matrix Multiplication for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } }} } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; Ignore conflict misses. Estimate the hit ratio for the block computation if b=16. Hit ratio = 99.8%

Cache Performance Analysis of Matrix Multiplication in CMPUT.229