CS 201 The Memory Hierarchy I

CS 201 The Memory Hierarchy I

Random-Access Memory (RAM) Non-persistent program code and data stored in two types of memory Static RAM (SRAM) • Each cell stores bit with a six-transistor circuit. • Retains value as long as it is kept powered. • Faster and more expensive than DRAM. Dynamic RAM (DRAM) • Each cell stores bit with a capacitor and transistor. • Value must be refreshed every 10-100 ms. • Slower and cheaper than SRAM.

SRAM vs DRAM Summary Tran. Access per bit time Persist? Sensitive? Cost Applications SRAM 6 1X Yes No 100x cache memories DRAM 1 10X No Yes 1X Main memories, frame buffers

CPU-Memory Gap metric 1980 1985 1990 1995 2000 2000:1980 access (ns) 300 150 35 15 2 100 SRAM metric 1980 1985 1990 1995 2000 2000:1980 access (ns) 375 200 100 70 60 6 DRAM 1980 1985 1990 1995 2000 2000:1980 processor 8080 286 386 Pent P-III clock rate(MHz) 1 6 20 150 750 750 cycle time(ns) 1,000 166 50 6 1.6 750 CPU (Culled from back issues of Byte and PC Magazine)

Addressing the CPU-Memory Gap Observation • Well-written programs tend to exhibit good locality • Fast memory can keep up with CPU speeds, but cost more per byte and have less capacity. Suggest an approach for organizing memory and storage systems known as a memory hierarchy. • Keep most often used data in fast memory near the CPU

L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. An Example Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices L0: registers CPU registers hold words retrieved from L1 cache. on-chip cache (SRAM) L1: off-chip cache (SRAM) L2: main memory (DRAM) L3: Larger, slower, and cheaper (per byte) storage devices local secondary storage (local disks) L4: remote secondary storage (distributed file systems, Web servers) L5:

Intel Core i7 cache hierarchy Processor package Core 0 Core 3 Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache … L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory

Intel Core i7 cache hierarchy

Memory hierarchy Fundamental idea • For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. • Programs tend to access the data at level k more often than they access the data at level k+1. What makes memory hierarchies work?

Locality Principle of Locality: • Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves. • Temporal locality: Recently referenced items are likely to be referenced in the near future. • Spatial locality: Items with nearby addresses tend to be referenced close together in time.

Locality example • Give an example of spatial locality for data access • Give an example of temporal locality for data access • Give an example of spatial locality for instruction access • Give an example of temporal locality for instruciton access sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

Locality example • Give an example of spatial locality for data access • Reference array elements in succession (stride-1 pattern) • Give an example of temporal locatity for data access • Reference sum each iteration • Give an example of spatial locality for instruction access • Reference instructions in sequence • Give an example of temporal locality for instruciton access • Cycle through loop repeadedly sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

Locality example Question: Which function has better locality? int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum }

Practice problem 6.8 Question: Can you permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum }

Practice problem 6.8 Question: Can you permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (k = 0; k < M; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[k][i][j]; return sum }

Practice problem 6.9 Rank-order the functions with respect to the spatial locality enjoyed by each void clear1(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0;j<3;j++) p[i].vel[j]=0; for (j=0; j<3 ; j++) p[i].acc[j]=0; } #define N 1000 typedef struct { int vel[3]; int acc[3]; } point; point p[N]; void clear2(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0; j<3 ; j++) { p[i].vel[j]=0; p[i].acc[j]=0; } } void clear3(point *p, int n) { int i,j; for (j=0; j<3 ; j++) { for (i=0; i<n ; i++) p[i].vel[j]=0; for (i=0; i<n ; i++) p[i].acc[j]=0; }

Practice problem 6.9 Rank-order the functions with respect to the spatial locality enjoyed by each void clear1(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0;j<3;j++) p[i].vel[j]=0; for (j=0; j<3 ; j++) p[i].acc[j]=0; } #define N 1000 typedef struct { int vel[3]; int acc[3]; } point; point p[N]; Best void clear2(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0; j<3 ; j++) { p[i].vel[j]=0; p[i].acc[j]=0; } } void clear3(point *p, int n) { int i,j; for (j=0; j<3 ; j++) { for (i=0; i<n ; i++) p[i].vel[j]=0; for (i=0; i<n ; i++) p[i].acc[j]=0; } Worst

Caches A smaller, faster memory device that acts as a staging area for a subset of the data in a larger, slower device. • Typically, caches refer to fast SRAM-based memories managed automatically in hardware. • Hold frequently accessed blocks of main memory • CPU looks first for data in L1, then in L2, then in main memory. CPU chip register file ALU L1 cache cache bus system bus memory bus main memory L2 cache bus interface I/O bridge

General Caching Concepts Program needs object d, which is stored in some block b. Cache hit • Program finds b in the cache at level k. E.g., block 14. Cache miss • b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. • If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? • Placement policy: where can the new block go? E.g., b mod 4 • Replacement policy: which block should be evicted? E.g., LRU Request 12 Request 14 14 12 0 1 2 3 Level k: 14 4* 9 14 3 12 4* Request 12 12 4* 0 1 2 3 Level k+1: 4 5 6 7 4* 8 9 10 11 12 13 14 15 12

General Caching Concepts Types of cache misses: • Cold (compulsory) miss • Cold misses occur because the cache is empty. • Program start-up or context switch • Conflict miss • Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. • E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. • Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. • E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. • Capacity miss • Occurs when the set of active cache blocks (working set) is larger than the cache.

Cache Type What Cached Where Cached Latency (cycles) Managed By Registers 4-byte word CPU registers 0 Compiler TLB Address translations On-Chip TLB 0 Hardware L1 cache 32-byte block On-Chip L1 1 Hardware L2 cache 32-byte block Off-Chip L2 10 Hardware Virtual Memory 4-KB page Main memory 100 Hardware+OS Buffer cache Parts of files Main memory 100 OS Network buffer cache Parts of files Local disk 10,000,000 AFS/NFS client Browser cache Web pages Local disk 10,000,000 Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server Examples of Caching in the Hierarchy

Generic Cache Organization Cache organized into S sets Set consists of E cache lines Line contains a memory block of B Cache defined by parameters (S, E, B, m) • C = Total size of cache • C = S*E*B M = Total size of main memory = 2m addresses • m = # of main memory address bits (32 for IA32) • Split into 3 different parts to access cache • t = tag = unique identifier for data • s = set index = identifier of set data will be placed in • b = block offset = byte in block being accessed

Generic Cache Organization Set index bits (s) • S sets • Index determines where (which set) in cache a memory block will be located • # of bits needed to represent index s = log2S Set consists of E cache lines • 1 (direct-mapped) to all (fully-associative) Block offset bits (b) Line contains a memory block of B = 2b bytes • B bytes per block (cache line) • Index determines which byte (in block) the byte will be located • # of bits needed to represent offset b = log2B Tag bits (t) • t = m – (s+b) • Tag uniquely identifies each memory block Cache defined by parameters (S, E, B, m) • Size of cache = C = S*E*B plus overhead for valid and tag bits

Practice problem 6.10 The table gives the parameters for a number of different caches. For each cache, derive the missing values Cache size = C = S * E * B # of tag bits = m - (s+b)

Direct mapped caches Simplest of caches to implement • One line per set (E = 1) • Each location in memory maps to a single cache entry • If multiple locations map to the same entry, their accesses will cause the cache to “thrash” from conflict misses • Recall = (S,E,B,m) • C = S*E*B = S*B for direct-mapped cache

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data S = 4 E = 1 block per set B = 1 byte per line/block m = 4 bits of address = 16 bytes of memory s = b = t =

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data S = 4 E = 1 block per set B = 1 byte per line/block m = 4 bits of address = 16 bytes of memory s = log2(S) = log2(4) = 2 • log2(# of cache lines) • Index is 2 LSB of address b = log2(B) = 0 t = 4 – (s+b) = 2 • Tag is 2 MSB of address t (2 bits) s (2 bits)

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data Access pattern: 0000 0011 1000 0011 1000 t (2 bits) s (2 bits)

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data Access pattern: 0000 0011 1000 0011 1000 Cache miss t (2 bits) s (2 bits)

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data Access pattern: 0000 0011 1000 0011 1000 Cache miss (Tag mismatch) t (2 bits) s (2 bits)

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data Access pattern: 0000 0011 1000 0011 1000 Cache hit t (2 bits) s (2 bits)

Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4) Main memory Cache Index Tag Data Access pattern: 0000 0011 1000 0011 1000 Cache hit Miss rate = 3/5 = 60% t (2 bits) s (2 bits)

Issue #1: Simple direct-mapped cache (S,E,B,m) = (4,1,1,4) No spatial locality Main memory Cache Index Tag Data Access pattern: 0000 0001 0010 0011 0100 0101 … Miss rate = ? t (2 bits) s (2 bits)

Block size issues Larger block sizes can take advantage of spatial locality by loading data from not just one address, but also nearby addresses, into the cache. Increase hit rate for sequential access Use least significant bits of address as block offset • Sequential bytes in memory are stored together in a block (or cache line) Use next significant bits (middle bits) to choose the set in the cache certain blocks map into. • Number of bits = log2(# of sets in cache) Use most significant bits as a tag to uniquely identify blocks in cache

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Main memory Cache Index Tag Data1 Data0 b = s = t= Partition the address to identify which pieces are used for the tag, the index, and the block offset

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Main memory Cache Index Tag Data1 Data0 b = 1 s = 1 t= 2 Tag = 2 MSB of address Index = Bit 1 of address = log2(# of cache lines) Block offset = LSB of address = log2(# blocks/line) t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Main memory Cache Index Tag Data1 Data0 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache miss t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Main memory Cache Index Tag Data1 Data0 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache hit t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Cache Index Tag Data1 Data0 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache hit Miss rate = ? t (2 bits) s (1 bit) b (1 bit)

Exercise Consider the following cache (S,E,B,m) = (2,1,4,5) • Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) s = b = t = • Draw a diagram of which bits of the address are used for the tag, the set index and the block offset • Draw a diagram of the cache

Exercise Consider the following cache (S,E,B,m) = (2,1,4,5) • Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) s = 1 b = 2 t = 2 • Draw a diagram of which bits of the address are used for the tag, the set index and the block offset • Draw a diagram of the cache t (2 bits) s (1 bit) b (2 bits) s=0 tag (2 bits) data (4 bytes) s=1 tag (2 bits) data (4 bytes)

Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) • Derive the miss rate for the following access pattern 00000 00001 00011 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 tag (2 bits) data (4 bytes) s=1 tag (2 bits) data (4 bytes)

Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) • Derive the miss rate for the following access pattern 00000 Miss 00001 00011 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 tag (2 bits) data (4 bytes)

Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) • Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 tag (2 bits) data (4 bytes)

CS 201 The Memory Hierarchy I

CS 201 The Memory Hierarchy I

Presentation Transcript

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy CS 740 Sept. 17, 2007

The Memory Hierarchy

The Memory Hierarchy

Memory Hierarchy

CS 201 The Memory Hierarchy II

The Memory Hierarchy

Memory hierarchy

Memory Hierarchy

The Memory Hierarchy

Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy

The Memory Hierarchy CS 740 Sept. 28, 2001

The Memory Hierarchy

CS 201 The Memory Heirarchy

The Memory Hierarchy