Computer Architecture Cache Memory

Computer ArchitectureCache Memory By Dan Tsafrir 26/3/2012, 2/4/2012Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

In the olden days… • The predecessor of ENIAC (the first general-purpose electronic computer) • Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann • Unlike ENIAC, binary rather than decimal, and a “stored program” machine • Operational until 1961 EDVAC (Electronic DiscreteVariable Automatic Computer)

In the olden days… • In 1945, Von Neumann wrote:“…This result deserves to be noted. It shows in a most striking way where the real diﬃculty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.” Von Neumann & EDVAC

In the olden days… • Later, in 1946, he wrote:“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible” Von Neumann & EDVAC

Not so long ago… • In 1994, in their paper “Hitting the Memory Wall: Implications of the Obvious”, William Wulf and Sally McKee said:“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”

1000 CPU 100 Performance Gap grew 50% per year 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Not so long ago… CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs

More recently (2008)… Fast Conventionalarchitecture lower = slower Performance (seconds) Processor cores Slow The memory wall in the multicore era

Memory Trade-Offs • Large (dense) memories are slow • Fast memories are small, expensive and consume high power • Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap • Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L3 Cache L1 Cache Memory (DRAM) CPU L2 Cache

Typical levels in mem hierarchy

Why Hierarchy Works: Locality • Temporal Locality (Locality in Time): • If an item is referenced, it will tend to be referenced again soon • Example: code and variables in loops Keep recently accessed data closer to the processor • Spatial Locality (Locality in Space): • If an item is referenced, nearby items tend to be referenced soon • Example: scanning an array Move contiguous blocks closer to the processor • Due to locality, memory hierarchy is a good idea • We’re going to use what we’ve just recently used • And we’re going to use its immediate neighborhood

Bad locality behavior Temporal Locality Spatial Locality Programs with locality cache well ... Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Memory Address (one dot per access) Time

2012-04-02

Memory Hierarchy: Terminology • For each memory level define the following • Hit: data appears in the memory level • Hit Rate: the fraction of accesses found in that level • Hit Time: time to access the memory level • includes also the time to determine hit/miss • Miss: need to retrieve data from next level • Miss Rate: 1 - (Hit Rate) • Miss Penalty: Time to bring in the missing info (replace a block) + Time to deliver the info to the accessor • Average memory access time = t_effective = (Hit time  Hit Rate) + (Miss Time  Miss Rate) = (Hit time  Hit Rate) + (Miss Time  (1- Hit Rate)) • If hit rate is close to 1, t_effective is close to Hit time, which is generally what we want

Effective Memory Access Time • Cache – holds a subset of the memory • Hopefully – the subset that is being used now • Known as “the working set” • Effective memory access time • teffective = (tcache Hit Rate) + (tmem (1 – Hit rate)) • tmem includes the time it takes to detect a cache miss • Example • Assume: tcache = 10 ns , tmem = 100 nsec Hit Rate t eff (nsec) 0 100 50 55 90 20 99 10.9 99.9 10.1 • tmem/tcache goes up  more important that hit-rate closer to 1

memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 Block # offset 2 4 92 Cache – main idea • The cache holds a small part of the entire memory • Need to map parts of the memory into the cache • Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines” • Typical block size is 32, 64, 128 bytes • Blocks are “aligned” in memory • Cache partitioned to cache lines • Each cache line holds a block • Only a subset of the blocks is mapped to the cache at a given time • The cache views an address as • Why use lines/blocks rather than words?

memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 2 4 92 Cache Lookup • Cache hit • Block is mapped to the cache – return data according to block’s offset • Cache miss • Block is not mapped to the cache  do a cacheline fill • Fetch block into fill buffer • may require few bus cycle • Write fill buffer into cache • May need to evict another block from the cache • Make room for the new block

31 4 0 Tag = Block# Offset Tag Array Data array v Line Tag = = = valid bit 31 0 hit data Checking valid bit & tag • Initially cache is empty • Need to have a “line valid” indication – line valid bit • A line may also be invalidated

Cache organization • We will go over • Direct cache map • 2-way set associative cache • N-way set associative cache • Fully-associative cache

Address Block number 31 13 4 0 14 5 Tag Set Offset Tag Line 29 =512 sets Set# 0 31 Data Array Tag Array Direct Map Cache • Offset • Byte within the cache-line • Set • The index into the “cache array”, and to the “tag array” • For a given set (an index), only one of the cache lines that has this set can reside in the cache • Tag • The rest of the block bits are used as tag • This is how we identify the individual cache line • Namely, we compare to the tag of the address to the tag stored in the cache’s tag array

x Cache Size x Cache Size . . . . Mapped to set X x Cache Size Direct Map Cache (cont) • Partition memory into slices • slice size = cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice start indicates position in cache (set) • Advantages • Easy & fast hit/miss resolution • Easy & fast replacement algorithm • Lowest power • Disadvantage • Line has only “one chance” • Lines replaced due to “set conflict misses” • Organization with highest miss-rate

Address Fields 31 14 13 5 4 0 Tag Set Offset Tag = Hit/Miss Tag Array Direct Map Cache – Example Line Size: 32 bytes  5 Offset bits Cache Size: 16KB = 214 Bytes #lines = cache size / line size = 214/25=29=512 #sets = #lines = 512 #set bits = 9 bits (=5…13) #Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31) Lookup Address: 0x12345678 0001 0010 0011 0100 0101 0110 0111 1000 tag= 0x048B1 set= 0x0B3 offset= 0x18

Direct map (tiny example) • Assume • Memory size is 2^5 = 32 bytes • For this, need 5-bit address • A block is comprised of 4 bytes • Thus, there are exactly 8 blocks • Note • Need only 3-bits to identify a block • The offset is exclusively used within the cache lines • The offset is not used do locate the cache line Offset (within a block) Block index Address 11111 Address 01110 Address 00001

Direct map (tiny example) • Further assume • The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits) • The address divides like so • b4 b3 | b2 | b1 b0 • tag | set | offset Offset (within a block) Block index even cache lines odd cache lines tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 0 00 1 0 (= marked “C”) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 0 10 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 1 00 1 0 (=Q) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (10) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 1 10 1 0 (=J) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (11) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 0 01 1 0 (=B) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Accessing address • 0 11 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

Direct map (tiny example) • Now assume • The size of our cache is 4 cache-lines • The address divides like so • b4 | b3 b2 | b1 b0 • tag | set | offset Offset (within a block) Block index memory array(bytes) tag array(bits) cache array(bytes)

Address Fields 31 12 4 0 13 5 Tag Set Offset Tag Line Tag Line Set# Set# Way 0 Tag Array 31 Cachestorage 0 Way 1 Tag Array 31 Cachestorage 0 WAY #0 WAY #1 2-Way Set Associative Cache • Each set holds two line (way 0 and way 1) • Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel) • Cache effectively partitioned into two Example: Line Size: 32 bytes Cache Size 16KB #of lines 512 lines #sets 256 Offset bits 5 bits Set bits 8 bits Tag bits 19 bits Address 0001 0010 0011 01000101 0110 0111 1000 Offset: 1 1000 = 0x18 = 24 Set: 1011 0011 = 0x0B3 = 179 Tag: 000 1001 0001 1010 0010 = = 0x091A2

31 13 12 5 4 0 Tag Set Offset Tag Data Tag Data Set# Way 0 Way 1 = = MUX Data Out Hit/Miss 2-Way Cache – Hit Decision

x Way Size x Way Size . . . . Mapped to set X x Way Size 2-Way Set Associative Cache (cont) • Partition memory into “slices” or “ways” • slice size = way size = ½ cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice-start indicates position in cache (set) • Compared to direct map cache • Half size slice  2× number of slices 2× number of blocks mapped to each set in the cache • But in each set we can have 2 blocks at a given time • More logic, warmer, more power consuming, but less collision/eviction

N-way set associative cache • Similarly to 2-way • At the extreme, every cache line is a way…

Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • An address is partitioned to • offset within block • block number • Each block may be mapped to each of the cache lines • Lookup block in all lines • Each cache line has a tag • All tags are compared to the block# in parallel • Need a comparator per line • If one of the tags matches the block#, we have a hit • Supply data according to offset • Best hit rate, but most wasteful • Must be relatively small

Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • Is said to be a “CAM” • Content Addressable Memory

Cache organization summary • Increasing set associativity • Improves hit rate • Increases power consumption • Increases access time • Strike a balance

Cache Read Miss • On a read miss – perform a cache line fill • Fetch entire block that contains the missing data from memory • Block is fetched into the cache line fill buffer • May take a few bus cycles to complete the fetch • e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles • Can stream (forward) the critical chunk into the core before the line fill ends • Once the entire block fetched into the fill buffer • It is moved into the cache

Cache Replacement • Direct map cache – easy • A new block is mapped to a single line in the cache • Old line is evicted (re-written to memory if needed) • N-way set associative cache – harder • Choose a victim from all ways in the appropriate set • But which? To determine, use a replacement algorithm • Replacement algorithms • FIFO (First In First Out) • Random • LRU (Least Recently used) • Optimum (theoretical, postmortem, called “Belady”) • Aside from the theoretical optimum, of the above, LRU is the best • But benchmarks show not that much better than random…

16-Apr-2012

LRU Implementation • 2 ways • 1 bit per set to mark latest way accessed in set • Evict way not pointed by bit • k-way set associative LRU • Requires full ordering of way accesses • Algorithm: when way i is accessed x = counter[i] counter[i] = k-1 for (j = 0 to k-1) if( (ji) && (counter[j]>x) ) counter[j]--; • When replacement is needed • evict way with counter = 0 • Expensive even for small k-s • Because invoked for every load/store • Need a log2k bit counter per line Initial State Way 0 1 2 3 Count 0 1 2 3 Access way 2 Way 0 1 2 3 Count 0 1 32 Access way 0 Way 0 1 2 3 Count3 0 21

Pseudo LRU (PLRU) • In practice, it’s sufficient to efficiently approximate LRU • Maintain k-1 bits, instead of k ∙ log2k bits • Assume k=4, and let’s enumerate the way’s cache lines • We need 2 bits: cache line 00, cl-01, cl-10, and cl-11 • Use a binary search tree to represent the 4 cache lines • Set each of the 3 (=k-1) internal nodes to holda bit variable: B0, B1, and B2 • Whenever accessing a cache line b1b0 • Set the bit variable Bj to be thecorresponding cache line bit bk • Can think about the bit value as Bj “right side was referenced more recently” • Need to evict? Walk tree as follows: • Go left if Bj = 1; go right if Bj = 0 • Evict the leaf you’ve reached (= the oppositedirection relative to previous insertions) B0 0 1 B1 B2 1 0 1 0 11 01 10 00 cache lines

Pseudo LRU (PLRU) – Example • Access 3 (11), 0 (00), 2 (10), 1 (01) => next victim is 3 (11), as expected 1 B0 1 0 0 0 1 1 0 0 1 1 0 0 B1 0 1 1 B2 3 0 2 B1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 11 11 11 11 01 01 10 10 01 01 10 10 00 00 00 00 0 0 1 1 0 1 0 1 0 1 11 01 10 00 cache lines

LRU vs. Random vs. FIFO • LRU: hardest • FIFO: easier, approximates LRU (oldest rather the LRU) • Random: easiest • Results: • Misses per 1000 instructions in L1-d, on average • Average across ten SPECint2000 / SPECfp2000 benchmarks • PLRU turns out rather similar to LRU

Effect of Cache on Performance • MPI (miss per instruction) • Fraction of instructions (out of total) that experience a miss • (Memory accesses per instruction = fraction of instructions that access the memory) • MPI = Memory accesses per instruction × Miss rate • Memory stall cycles = |Memory accesses| × Miss rate × Miss penalty cycles = IC × MPI × Miss penalty cycles • CPU time = (CPU execution cycles + Memory stall cycles) × cycle time = IC × (CPIexecution + MPI × Miss penalty cycles) × cycle time

Memory Update Policy on Writes • Write back (lazy writes to DRAM; prefer cache) • Write through (immediately writing to DRAM)

Write Back • Store operations that hit the cache • Write only to cache; main memory not accessed • Line marked as “modified” or “dirty” • When evicted, line written to memory only if dirty • Pros: • Saves memory accesses when line updated more than once • Attractive for multicore/multiprocessor • Cons: • On eviction, the entire line must be written to memory (there’s no indication which bytes within the line were modified) • Read miss might require writing to memory (evicted line is dirty) • More susceptible to “soft errors” • Transient errors; in some designs detectable but unrecoverable; especially problematic for supercomputers

Write Through • Stores that hit the cache • Write to cache, and • Write to memory • Need to write only the bytes that were changed • Not entire line • Less work • When evicting, no need to write to DRAM • Never dirty, so don’t need to be written • Still need to throw stuff out, though • Use write buffers • To mask waiting for lower level memory

Processor Cache DRAM Write Buffer Write through: need write-buffer • A write buffer between cache & memory • Processor core: writes data into cache & write buffer • Memory controller: writes contents of buffer to memory • Works ok if store frequency in cycles << DRAM write cycle • Otherwise store buffer overflows no matter how big it is • Write combining • Combine adjacent writes to same location in write buffer • Note: on cache miss need to lookup write buffer (or drain it)

Computer Architecture Cache Memory