Large and Fast: Exploiting Memory Hierarchy

Large and Fast: Exploiting Memory Hierarchy

Memory Technology • Static RAM (SRAM) • 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) • 50ns – 70ns, $20 – $75 per GB • Magnetic disk • 5ms – 20ms, $0.20 – $2 per GB • Ideal memory • Access time of SRAM • Capacity and cost/GB of disk

Principle of Locality • Programs access a small proportion of their address space at any time • Temporal locality • Items accessed recently are likely to be accessed again soon • e.g., instructions in a loop, induction variables • Spatial locality • Items near those accessed recently are likely to be accessed soon • E.g., sequential instruction access, array data

Taking Advantage of Locality • Memory hierarchy • Store everything on disk • Copy recently accessed (and nearby) items from disk to smaller DRAM memory • Main memory • Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory • Cache memory attached to CPU

Mapping Function Cache size = 64KB  216 Bytes Cache block = 4B  22 Bytes  1 word Number of lines = 214

0001 0110 00110011100111 00 1 6 Line# = 00110011100111 = 0CE7

XNOR = XOR

Next slide

000101100011001110011100 0 5 8 C E 7 Line# = 00110011100111 = 0CE7

Set Associative Cache • hybrid between a fully associative cache & direct mapped cache. • considered a reasonable compromise between the complex • hardware needed for fully associative caches (parallel searches of • all slots), and the simplistic direct-mapped scheme, • may cause collisions of addresses to the same slot

Set Associative Cache • Parking lot analogy • 1000 parking spots (0-999) – 3 digits needed to find slot • Instead use 2 digits (say last 2 digits of SSN) to hash into 100 • sets of 10 spots. • Once you find your set, you can park in any of the 10 spots.

Finding a Set • There are 213 sets (also known as 8K-way set associative cache) • Parallel search all 213 sets to see if 9-bit tag matches a particular slot) • If found, go to matching set (b14 – b2) bits get word at b1 – b0, • o.w., decide which slot you want to put data in from memory.

Tag Data Set

Set#1 Set#0 Line# (14 bits) 9 bits Set#(13 bits) 2 bits 000101100011001110011100 0 2 C set# (2 way - consider 1 out of 13 bits) Line# = 00110011100111 = 0CE7

Measuring Cache Performance • Components of CPU time • Program execution cycles • Includes cache hit time • Memory stall cycles • Mainly from cache misses • With simplifying assumptions:

Cache Performance Example • Given • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (ideal cache) = 2 • Load & stores are 36% of instructions • Miss cycles per instruction • I-cache: 0.02 × 100 = 2 • D-cache: 0.36 × 0.04 × 100 = 1.44 • Actual CPI = 2 + 2 + 1.44 = 5.44 • Ideal CPU is 5.44/2 =2.72 times faster

Large and Fast: Exploiting Memory Hierarchy