1 / 74

Computer Architecture Cache Memory

Computer Architecture Cache Memory. By Dan Tsafrir 26/3/2012, 2/4/2012 Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz. In the olden days…. The predecessor of ENIAC (the first general-purpose electronic computer)

Download Presentation

Computer Architecture Cache Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer ArchitectureCache Memory By Dan Tsafrir 26/3/2012, 2/4/2012Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

  2. In the olden days… • The predecessor of ENIAC (the first general-purpose electronic computer) • Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann • Unlike ENIAC, binary rather than decimal, and a “stored program” machine • Operational until 1961 EDVAC (Electronic DiscreteVariable Automatic Computer)

  3. In the olden days… • In 1945, Von Neumann wrote:“…This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.” Von Neumann & EDVAC

  4. In the olden days… • Later, in 1946, he wrote:“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible” Von Neumann & EDVAC

  5. Not so long ago… • In 1994, in their paper “Hitting the Memory Wall: Implications of the Obvious”, William Wulf and Sally McKee said:“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”

  6. 1000 CPU 100 Performance Gap grew 50% per year 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Not so long ago… CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs

  7. More recently (2008)… Fast Conventionalarchitecture lower = slower Performance (seconds) Processor cores Slow The memory wall in the multicore era

  8. Memory Trade-Offs • Large (dense) memories are slow • Fast memories are small, expensive and consume high power • Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap • Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L3 Cache L1 Cache Memory (DRAM) CPU L2 Cache

  9. Typical levels in mem hierarchy

  10. Why Hierarchy Works: Locality • Temporal Locality (Locality in Time): • If an item is referenced, it will tend to be referenced again soon • Example: code and variables in loops Keep recently accessed data closer to the processor • Spatial Locality (Locality in Space): • If an item is referenced, nearby items tend to be referenced soon • Example: scanning an array Move contiguous blocks closer to the processor • Due to locality, memory hierarchy is a good idea • We’re going to use what we’ve just recently used • And we’re going to use its immediate neighborhood

  11. Bad locality behavior Temporal Locality Spatial Locality Programs with locality cache well ... Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Memory Address (one dot per access) Time

  12. 2012-04-02

  13. Memory Hierarchy: Terminology • For each memory level define the following • Hit: data appears in the memory level • Hit Rate: the fraction of accesses found in that level • Hit Time: time to access the memory level • includes also the time to determine hit/miss • Miss: need to retrieve data from next level • Miss Rate: 1 - (Hit Rate) • Miss Penalty: Time to bring in the missing info (replace a block) + Time to deliver the info to the accessor • Average memory access time = t_effective = (Hit time  Hit Rate) + (Miss Time  Miss Rate) = (Hit time  Hit Rate) + (Miss Time  (1- Hit Rate)) • If hit rate is close to 1, t_effective is close to Hit time, which is generally what we want

  14. Effective Memory Access Time • Cache – holds a subset of the memory • Hopefully – the subset that is being used now • Known as “the working set” • Effective memory access time • teffective = (tcache Hit Rate) + (tmem (1 – Hit rate)) • tmem includes the time it takes to detect a cache miss • Example • Assume: tcache = 10 ns , tmem = 100 nsec Hit Rate t eff (nsec) 0 100 50 55 90 20 99 10.9 99.9 10.1 • tmem/tcache goes up  more important that hit-rate closer to 1

  15. memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 Block # offset 2 4 92 Cache – main idea • The cache holds a small part of the entire memory • Need to map parts of the memory into the cache • Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines” • Typical block size is 32, 64, 128 bytes • Blocks are “aligned” in memory • Cache partitioned to cache lines • Each cache line holds a block • Only a subset of the blocks is mapped to the cache at a given time • The cache views an address as • Why use lines/blocks rather than words?

  16. memory 0 1 2 3 4 5 6 . . . 90 91 92 93 . . . cache 90 2 4 92 Cache Lookup • Cache hit • Block is mapped to the cache – return data according to block’s offset • Cache miss • Block is not mapped to the cache  do a cacheline fill • Fetch block into fill buffer • may require few bus cycle • Write fill buffer into cache • May need to evict another block from the cache • Make room for the new block

  17. 31 4 0 Tag = Block# Offset Tag Array Data array v Line Tag = = = valid bit 31 0 hit data Checking valid bit & tag • Initially cache is empty • Need to have a “line valid” indication – line valid bit • A line may also be invalidated

  18. Cache organization • We will go over • Direct cache map • 2-way set associative cache • N-way set associative cache • Fully-associative cache

  19. Address Block number 31 13 4 0 14 5 Tag Set Offset Tag Line 29 =512 sets Set# 0 31 Data Array Tag Array Direct Map Cache • Offset • Byte within the cache-line • Set • The index into the “cache array”, and to the “tag array” • For a given set (an index), only one of the cache lines that has this set can reside in the cache • Tag • The rest of the block bits are used as tag • This is how we identify the individual cache line • Namely, we compare to the tag of the address to the tag stored in the cache’s tag array

  20. x Cache Size x Cache Size . . . . Mapped to set X x Cache Size Direct Map Cache (cont) • Partition memory into slices • slice size = cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice start indicates position in cache (set) • Advantages • Easy & fast hit/miss resolution • Easy & fast replacement algorithm • Lowest power • Disadvantage • Line has only “one chance” • Lines replaced due to “set conflict misses” • Organization with highest miss-rate

  21. Address Fields 31 14 13 5 4 0 Tag Set Offset Tag = Hit/Miss Tag Array Direct Map Cache – Example Line Size: 32 bytes  5 Offset bits Cache Size: 16KB = 214 Bytes #lines = cache size / line size = 214/25=29=512 #sets = #lines = 512 #set bits = 9 bits (=5…13) #Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31) Lookup Address: 0x12345678 0001 0010 0011 0100 0101 0110 0111 1000 tag= 0x048B1 set= 0x0B3 offset= 0x18

  22. Direct map (tiny example) • Assume • Memory size is 2^5 = 32 bytes • For this, need 5-bit address • A block is comprised of 4 bytes • Thus, there are exactly 8 blocks • Note • Need only 3-bits to identify a block • The offset is exclusively used within the cache lines • The offset is not used do locate the cache line Offset (within a block) Block index Address 11111 Address 01110 Address 00001

  23. Direct map (tiny example) • Further assume • The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits) • The address divides like so • b4 b3 | b2 | b1 b0 • tag | set | offset Offset (within a block) Block index even cache lines odd cache lines tag array(bits) cache array(bytes) memory array(bytes)

  24. Direct map (tiny example) • Accessing address • 0 00 1 0 (= marked “C”) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  25. Direct map (tiny example) • Accessing address • 0 10 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  26. Direct map (tiny example) • Accessing address • 1 00 1 0 (=Q) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (10) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  27. Direct map (tiny example) • Accessing address • 1 10 1 0 (=J) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (11) | set (0)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  28. Direct map (tiny example) • Accessing address • 0 01 1 0 (=B) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (00) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  29. Direct map (tiny example) • Accessing address • 0 11 1 0 (=Y) • The address divides like so • b4 b3 | b2 | b1 b0 • tag (01) | set (1)| offset (10) Offset (within a block) Block index tag array(bits) cache array(bytes) memory array(bytes)

  30. Direct map (tiny example) • Now assume • The size of our cache is 4 cache-lines • The address divides like so • b4 | b3 b2 | b1 b0 • tag | set | offset Offset (within a block) Block index memory array(bytes) tag array(bits) cache array(bytes)

  31. Direct map (tiny example) • Now assume • The size of our cache is 4 cache-lines • The address divides like so • b4 | b3 b2 | b1 b0 • tag | set | offset Offset (within a block) Block index memory array(bytes) tag array(bits) cache array(bytes)

  32. Address Fields 31 12 4 0 13 5 Tag Set Offset Tag Line Tag Line Set# Set# Way 0 Tag Array 31 Cachestorage 0 Way 1 Tag Array 31 Cachestorage 0 WAY #0 WAY #1 2-Way Set Associative Cache • Each set holds two line (way 0 and way 1) • Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel) • Cache effectively partitioned into two Example: Line Size: 32 bytes Cache Size 16KB #of lines 512 lines #sets 256 Offset bits 5 bits Set bits 8 bits Tag bits 19 bits Address 0001 0010 0011 01000101 0110 0111 1000 Offset: 1 1000 = 0x18 = 24 Set: 1011 0011 = 0x0B3 = 179 Tag: 000 1001 0001 1010 0010 = = 0x091A2

  33. 31 13 12 5 4 0 Tag Set Offset Tag Data Tag Data Set# Way 0 Way 1 = = MUX Data Out Hit/Miss 2-Way Cache – Hit Decision

  34. x Way Size x Way Size . . . . Mapped to set X x Way Size 2-Way Set Associative Cache (cont) • Partition memory into “slices” or “ways” • slice size = way size = ½ cache size • Partition each slice to blocks • Block size = cache line size • Distance of block from slice-start indicates position in cache (set) • Compared to direct map cache • Half size slice  2× number of slices 2× number of blocks mapped to each set in the cache • But in each set we can have 2 blocks at a given time • More logic, warmer, more power consuming, but less collision/eviction

  35. N-way set associative cache • Similarly to 2-way • At the extreme, every cache line is a way…

  36. Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • An address is partitioned to • offset within block • block number • Each block may be mapped to each of the cache lines • Lookup block in all lines • Each cache line has a tag • All tags are compared to the block# in parallel • Need a comparator per line • If one of the tags matches the block#, we have a hit • Supply data according to offset • Best hit rate, but most wasteful • Must be relatively small

  37. Address Fields 31 4 0 Tag = Block# Offset Tag Array Data array 31 0 Line Tag = = = hit data Fully Associative Cache • Is said to be a “CAM” • Content Addressable Memory

  38. Cache organization summary • Increasing set associativity • Improves hit rate • Increases power consumption • Increases access time • Strike a balance

  39. Cache Read Miss • On a read miss – perform a cache line fill • Fetch entire block that contains the missing data from memory • Block is fetched into the cache line fill buffer • May take a few bus cycles to complete the fetch • e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles • Can stream (forward) the critical chunk into the core before the line fill ends • Once the entire block fetched into the fill buffer • It is moved into the cache

  40. Cache Replacement • Direct map cache – easy • A new block is mapped to a single line in the cache • Old line is evicted (re-written to memory if needed) • N-way set associative cache – harder • Choose a victim from all ways in the appropriate set • But which? To determine, use a replacement algorithm • Replacement algorithms • FIFO (First In First Out) • Random • LRU (Least Recently used) • Optimum (theoretical, postmortem, called “Belady”) • Aside from the theoretical optimum, of the above, LRU is the best • But benchmarks show not that much better than random…

  41. 16-Apr-2012

  42. LRU Implementation • 2 ways • 1 bit per set to mark latest way accessed in set • Evict way not pointed by bit • k-way set associative LRU • Requires full ordering of way accesses • Algorithm: when way i is accessed x = counter[i] counter[i] = k-1 for (j = 0 to k-1) if( (ji) && (counter[j]>x) ) counter[j]--; • When replacement is needed • evict way with counter = 0 • Expensive even for small k-s • Because invoked for every load/store • Need a log2k bit counter per line Initial State Way 0 1 2 3 Count 0 1 2 3 Access way 2 Way 0 1 2 3 Count 0 1 32 Access way 0 Way 0 1 2 3 Count3 0 21

  43. Pseudo LRU (PLRU) • In practice, it’s sufficient to efficiently approximate LRU • Maintain k-1 bits, instead of k ∙ log2k bits • Assume k=4, and let’s enumerate the way’s cache lines • We need 2 bits: cache line 00, cl-01, cl-10, and cl-11 • Use a binary search tree to represent the 4 cache lines • Set each of the 3 (=k-1) internal nodes to holda bit variable: B0, B1, and B2 • Whenever accessing a cache line b1b0 • Set the bit variable Bj to be thecorresponding cache line bit bk • Can think about the bit value as Bj “right side was referenced more recently” • Need to evict? Walk tree as follows: • Go left if Bj = 1; go right if Bj = 0 • Evict the leaf you’ve reached (= the oppositedirection relative to previous insertions) B0 0 1 B1 B2 1 0 1 0 11 01 10 00 cache lines

  44. Pseudo LRU (PLRU) – Example • Access 3 (11), 0 (00), 2 (10), 1 (01) => next victim is 3 (11), as expected 1 B0 1 0 0 0 1 1 0 0 1 1 0 0 B1 0 1 1 B2 3 0 2 B1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 11 11 11 11 01 01 10 10 01 01 10 10 00 00 00 00 0 0 1 1 0 1 0 1 0 1 11 01 10 00 cache lines

  45. LRU vs. Random vs. FIFO • LRU: hardest • FIFO: easier, approximates LRU (oldest rather the LRU) • Random: easiest • Results: • Misses per 1000 instructions in L1-d, on average • Average across ten SPECint2000 / SPECfp2000 benchmarks • PLRU turns out rather similar to LRU

  46. Effect of Cache on Performance • MPI (miss per instruction) • Fraction of instructions (out of total) that experience a miss • (Memory accesses per instruction = fraction of instructions that access the memory) • MPI = Memory accesses per instruction × Miss rate • Memory stall cycles = |Memory accesses| × Miss rate × Miss penalty cycles = IC × MPI × Miss penalty cycles • CPU time = (CPU execution cycles + Memory stall cycles) × cycle time = IC × (CPIexecution + MPI × Miss penalty cycles) × cycle time

  47. Memory Update Policy on Writes • Write back (lazy writes to DRAM; prefer cache) • Write through (immediately writing to DRAM)

  48. Write Back • Store operations that hit the cache • Write only to cache; main memory not accessed • Line marked as “modified” or “dirty” • When evicted, line written to memory only if dirty • Pros: • Saves memory accesses when line updated more than once • Attractive for multicore/multiprocessor • Cons: • On eviction, the entire line must be written to memory (there’s no indication which bytes within the line were modified) • Read miss might require writing to memory (evicted line is dirty) • More susceptible to “soft errors” • Transient errors; in some designs detectable but unrecoverable; especially problematic for supercomputers

  49. Write Through • Stores that hit the cache • Write to cache, and • Write to memory • Need to write only the bytes that were changed • Not entire line • Less work • When evicting, no need to write to DRAM • Never dirty, so don’t need to be written • Still need to throw stuff out, though • Use write buffers • To mask waiting for lower level memory

  50. Processor Cache DRAM Write Buffer Write through: need write-buffer • A write buffer between cache & memory • Processor core: writes data into cache & write buffer • Memory controller: writes contents of buffer to memory • Works ok if store frequency in cycles << DRAM write cycle • Otherwise store buffer overflows no matter how big it is • Write combining • Combine adjacent writes to same location in write buffer • Note: on cache miss need to lookup write buffer (or drain it)

More Related