- 45 Views
- Uploaded on
- Presentation posted in: General

Cache Memory

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Cache Memory

- Cache mountain
- Matrix multiplication
- Suggested Reading: 6.6, 6.7

6.6 Putting it Together: The Impact of

Caches on Program Performance

6.6.1 The Memory Mountain

- Read throughput (read bandwidth)
- The rate that a program reads data from the memory system

- Memory mountain
- A two-dimensional function of read bandwidth versus temporal and spatial locality
- Characterizes the capabilities of the memory system for each computer

/* mountain.c - Generate the memory mountain. */

#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */

#define MAXBYTES (1 << 23) /* ... up to 8 MB */

#define MAXSTRIDE 16 /* Strides range from 1 to 16 */

#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

int main()

{

int size; /* Working set size (in bytes) */

int stride; /* Stride (in array elements) */

double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

Mhz = mhz(0); /* Estimate the clock frequency */

for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {

for (stride = 1; stride <= MAXSTRIDE; stride++)

printf("%.1f\t", run(size, stride, Mhz));

printf("\n");

}

exit(0);

}

/* The test function */

void test (int elems, int stride) {

int i, result = 0;

volatile int sink;

for (i = 0; i < elems; i += stride)

result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */

}

/* Run test (elems, stride) and return read throughput (MB/s) */

double run (int size, int stride, double Mhz)

{

double cycles;

int elems = size / sizeof(int);

test (elems, stride); /* warm up the cache */

cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */

return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */

}

- Data
- Size
- MAXBYTES(8M) bytes or MAXELEMS(2M) words

- Partially accessed
- Working set: from 8MB to 1KB
- Stride: from 1 to 16

- Size

- Slice through the memory mountain with stride=1
- illuminates read throughputs of different caches and memory

Ridges: 山脊

- Slice through memory mountain with size=256KB
- shows cache block size.

6.6 Putting it Together: The Impact of

Caches on Program Performance

6.6.2 Rearranging Loops to Increase Spatial Locality

Matrix Multiplication P517

/* ijk */

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

c[i][j] = 0.0;

for (k=0; k<n; k++)

c[i][j] += a[i][k] * b[k][j];

}

}

O(n3)adds and multiplies

Each n2 elements of A and B is read n times

- Assumptions:
- Each array is an nn array of double, with size 8
- There is a single cache with a 32-byte block size ( B=32 )
- The array size n is so large that a single matrix row does not fit in the L1 cache
- The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load and store instructions.

/* ijk */

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

Variable sum

held in register

Column-

wise

Fixed

- Misses per Inner Loop Iteration:
ABC

0.251.00.0

/* ijk */

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

Inner loop:

(*,j)

(i,j)

(i,*)

A

B

C

Row-wise

1) (AB)

Figure 6.46 P519

Row-wise

Column-

wise

Fixed

/* jik */

for (j=0; j<n; j++) {

for (i=0; i<n; i++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum

}

}

Inner loop:

(*,j)

(i,j)

(i,*)

A

B

C

- Misses per Inner Loop Iteration:
- ABC
- 0.251.00.0

1) (AB)

Figure 6.46 P519

Row-wise

Row-wise

Fixed

/* kij */

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

Inner loop:

(i,k)

(k,*)

(i,*)

A

B

C

3) (BC)

- Misses per Inner Loop Iteration:
- ABC
- 0.00.250.25

Figure 6.46 P519

Row-wise

Row-wise

Fixed

/* ikj */

for (i=0; i<n; i++) {

for (k=0; k<n; k++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

Inner loop:

(i,k)

(k,*)

(i,*)

A

B

C

3) (BC)

- Misses per Inner Loop Iteration:
- ABC
- 0.00.250.25

Figure 6.46 P519

Column -

wise

Column-

wise

Fixed

/* jki */

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

Inner loop:

(*,k)

(*,j)

(k,j)

A

B

C

2) (AC)

- Misses per Inner Loop Iteration:
- ABC
- 1.00.01.0

Figure 6.46 P519

Column-

wise

Column-

wise

Fixed

/* kji */

for (k=0; k<n; k++) {

for (j=0; j<n; j++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

Inner loop:

(*,k)

(*,j)

(k,j)

A

B

C

2) (AC)

- Misses per Inner Loop Iteration:
- ABC
- 1.00.01.0

Figure 6.46 P519

2) (AC)

2)

3)

1)

3) (BC)

1) (AB)

- Notice that miss rates are helpful but not perfect predictors.
- Code scheduling matters, too.

- ijk (& jik):
- 2 loads, 0 stores
- misses/iter = 1.25

- kij (& ikj):
- 2 loads, 1 store
- misses/iter = 0.5

- jki (& kji):
- 2 loads, 1 store
- misses/iter = 2.0

for (i=0; i<n; i++) {

for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

for (k=0; k<n; k++) {

for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}

}

for (j=0; j<n; j++) {

for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

}

}

1) (AB)

3) (BC)

2) (AC)

6.6 Putting it Together: The Impact of

Caches on Program Performance

6.6.3 Using Blocking to Increase Temporal Locality

- Example: Blocked matrix multiplication
- “block” (in this context) does not mean “cache block”.
- Instead, it mean a sub-block within the matrix.
- Example: N = 8; sub-block size = 4

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

=

X

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

for (jj=0; jj<n; jj+=bsize) {

for (i=0; i<n; i++)

for (j=jj; j < min(jj+bsize,n); j++)

c[i][j] = 0.0;

for (kk=0; kk<n; kk+=bsize) {

for (i=0; i<n; i++) {

for (j=jj; j < min(jj+bsize,n); j++) {

sum = 0.0

for (k=kk; k < min(kk+bsize,n); k++) {

sum += a[i][k] * b[k][j];

}

c[i][j] += sum;

}

}

}

}

- Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C
- Loop over i steps through n row slivers of A & C, using same B

Sliver: 长条

for (i=0; i<n; i++) {

for (j=jj; j < min(jj+bsize,n); j++) {

sum = 0.0

for (k=kk; k < min(kk+bsize,n); k++) {

sum += a[i][k] * b[k][j];

}

c[i][j] += sum;

}

Innermost

Loop Pair

kk

jj

jj

kk

i

i

A

B

C

Update successive

elements of sliver

row sliver accessed

bsize times

block reused n times in succession

Figure 6.49 P522

2)

3)

1)

6.7 Putting it Together: Exploring Locality in Your Programs

- Focus your attention on the inner loops
- Try to maximize the spatial locality in your programs by reading data objects sequentially, in the order they are stored in memory
- Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory
- Miss rates, the number of memory accesses