1 / 99

CS 201 The Memory Hierarchy II

CS 201 The Memory Hierarchy II. Last class. Two extremes in organizing caches Direct mapped E=1 (One block per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place in direct-mapped cache Fully associative

etana
Download Presentation

CS 201 The Memory Hierarchy II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 201 The Memory Hierarchy II

  2. Last class Two extremes in organizing caches • Direct mapped • E=1 (One block per set) • Each main memory address can be placed in exactly one place in the cache • Conflict misses if two addresses map to same place in direct-mapped cache • Fully associative • E = C/B (One set with all blocks in it) • Each main memory address can be placed anywhere in the cache • Eliminates conflict misses, but at a significant hardware cost • Set associative • Hybrid between the approaches

  3. Set associative caches Each memory address is assigned to a particular set in the cache, but not to a specific block in the set. • Set sizes range from 1 (direct-mapped) to all blocks in cache (fully associative). • Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost.

  4. 2-way associativity 4 sets, 2 blocks each 0 1 2 3 Set associative caches Cache is divided into groups of blocks, called sets. • Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set. If each set has 2x blocks (E), the cache is an 2x-way associative cache. Organizations of an eight-block cache. 1-way associativity (Direct mapped) 8 sets, 1 block each Set Set 0 1 2 3 4 5 6 7 4-way associativity 2 sets, 4 blocks each Set 0 1 8-way (full) associativity 1 set of 8 blocks Set 0

  5. t s b Block offset Address (m bits) Tag Index Locating a set associative block We can determine where a memory address belongs in an associative cache in a similar way as direct mapped caches. If a cache has 2s sets and each block has 2b bytes, the memory address can be partitioned as follows. The set index selects a set in the cache that the block is located

  6. Accessing Set Associative Caches Set selection valid tag cache block set 0: cache block valid tag valid tag cache block Selected set set 1: cache block valid tag • • • cache block valid tag set S-1: t bits s bits b bits cache block valid tag 0 0 0 0 1 m-1 0 tag set index block offset

  7. (1) The valid bit must be set. =1? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address = ? Accessing Set Associative Caches Line matching and word selection • Like a fully associative cache, cache must compare the tags in each valid block in the selected set. 0 1 2 3 4 5 6 7 1 1001 selected set (i): 1 0110 w0 w1 w2 w3 t bits s bits b bits 0110 i 100 m-1 0 tag set index block offset

  8. Address (m bits) Block offset Tag Index t s Index Valid Tag Data Valid Tag Data 0 ... 2s 2n 2n = = 2-to-1 mux Hit 2n Data 2-way set associative cache implementation Only two comparators are needed compared to fully associative cache

  9. Examples Assume 8 block cache with 16 bytes per block • Where would data from this address go 110000 0110011. • B=16 bytes, so the lowest 4 bits are the block offset. 2-way associativity 4 sets, 2 blocks each Set Set 0 1 2 3 0 1 2 3 4 5 6 7 1-way associativity 8 sets, 1 block each 4-way associativity 2 sets, 4 blocks each Set 0 1

  10. Examples Assume 8 block cache with 16 bytes per block • Where would data from this address go 110000 0110011. • B=16 bytes, so the lowest 4 bits are the block offset. • Set index • 1-way (i.e. direct mapped) cache • 8 sets so next three bits (011) are the set index. • 2-way cache • 4 sets so next two bits (11) are the set index. • 4-way cache • 2 sets so the next one bit (1) is the set index. Data may go in any block (in green) within the correct set. • Top bits are always for the tag 2-way associativity 4 sets, 2 blocks each Set Set 0 1 2 3 0 1 2 3 4 5 6 7 1-way associativity 8 sets, 1 block each 4-way associativity 2 sets, 4 blocks each Set 0 1

  11. Block replacement Any empty block in the correct set may be used for storing data. If there are no empty blocks, the cache will attempt to replace one • Common replacement policy: least recently used (LRU) • Assuming a block hasn’t been used in a while, it won’t be needed again anytime soon. 4-way associativity 2 sets, 4 blocks each Set 0 1

  12. -- A A B A A B h A B C A A B C C B A B h B A A LRU example Assume a fully-associative cache with two blocks, which of the following memory references miss in the cache. • assume distinct addresses go to distinct blocks 0 Tags 1 LRU addresses -- -- 0 A B A C B A B

  13. 2-way set associative cache Consider the following set associative cache: (S,E,B,m) = (2,2,1,4) • Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) s = b = t = • Draw a diagram of which bits of the address are used for the tag, the set index and the block offset • Draw a diagram of the cache

  14. 2-way set associative cache Consider the following set associative cache: (S,E,B,m) = (2,2,1,4) • Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) s = 1 b = 0 t = 3 • Draw a diagram of which bits of the address are used for the tag, the set index and the block offset • Draw a diagram of the cache t (3 bits) s (1 bit) Tag Data Tag Data s=0 Tag Tag Data Data s=1

  15. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … t (3 bits) s (1 bit)

  16. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache miss t (3 bits) s (1 bit)

  17. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache miss t (3 bits) s (1 bit)

  18. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache hit t (3 bits) s (1 bit)

  19. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache miss (replace LRU) t (3 bits) s (1 bit)

  20. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache hit t (3 bits) s (1 bit)

  21. 2-way set associative LRU cache (S,E,B,m) = (2,2,1,4) Main memory Cache Tag Tag Data Data Set 0 1 Access pattern: 0010 0110 0010 0000 0010 0110 … Cache miss t (3 bits) s (1 bit)

  22. Practice problem 6.13 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Excluding the overhead of the tags and valid bits, what is the capacity of this cache? Draw a figure of this cache Draw a diagram that shows the parts of the address that are used to determine the cache tag, cache set index, and cache block offset

  23. Practice problem 6.13 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Excluding the overhead of the tags and valid bits, what is the capacity of this cache? 64 bytes Draw a figure of this cache Draw a diagram that shows the parts of the address that are used to determine the cache tag, cache set index, and cache block offset Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  24. Practice problem 6.14 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank Consider an access to 0x0E34. • What is the block offset of this address in hex? • What is the set index of this address in hex? • What is the cache tag of this address in hex? • Does this access hit or miss in the cache? • What value is returned if it is a hit? Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  25. Practice problem 6.14 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank Consider an access to 0x0E34. • What is the block offset of this address in hex? 00 = 0x0 • What is the set index of this address in hex? 101 = 0x5 • What is the cache tag of this address in hex? 01110001 = 0x71 • Does this access hit or miss in the cache? = Hit • What value is returned if it is a hit? = 0x0B Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  26. Practice problem 6.15 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank Consider an access to 0x0DD5. • What is the block offset of this address in hex? • What is the set index of this address in hex? • What is the cache tag of this address in hex? • Does this access hit or miss in the cache? • What value is returned if it is a hit? Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  27. Practice problem 6.15 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank Consider an access to 0x0DD5. • What is the block offset of this address in hex? 01 = 0x1 • What is the set index of this address in hex? 101 = 0x5 • What is the cache tag of this address in hex? 01101110 = 0x6E • Does this access hit or miss in the cache? Miss (invalid block) • What value is returned if it is a hit? Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  28. Practice problem 6.17 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank List all hex memory addresses that will hit in Set 3 Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  29. Practice problem 6.17 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) • Note: Invalid cache lines are left blank List all hex memory addresses that will hit in Set 3 00110010011xx = 0x64C, 0x64D, 0x64E, 0x64F Tag Tag Data0-3 Data0-3 Set 0 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

  30. Cache design Size Associativity Block Size Replacement Algorithm Write Policy Number of Caches

  31. Block (cache line) size 8 to 64 bytes is typical Larger blocks decrease number of blocks in a given cache size, while including words that might not be accessed

  32. Replacement algorithms Most common LRU (least recently used) Others • First in first out (FIFO) • replace block that has been in cache longest • Least frequently used • replace block which has had fewest hits • Random

  33. Write Policy Governs how data is written to memory through cache Must not overwrite a cache block unless main memory is up to date • When do you update main memory? Must work in a multi-CPU, multi-core system • Multiple CPUs with individual caches

  34. Write through Cache policy where all writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic Slows down writes

  35. Write back Cache policy where updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches can get out of sync Some CPUs allow one to set the cache polcy for a particular program or memory page

  36. Number of caches As logic density increases, it has become advantages and practical to create multi-level caches: • on chip • off chip Examples of a deepening hierarchy • Pentium II: L1 (on chip) & L2 (off chip) • Pentium III: L1 and L2 (on-chip), L3 (off chip) • Intel Core: L1 and L2 per core (on-chip), Unified L3 (on-chip)

  37. Number of caches Another way to add caches is to split them based on what they store • Data cache • Program cache Advantage: • Likely increased hit rates – data and program accesses display different behavior Disadvantage • Complexity

  38. Split cache example Separate data and instruction caches vs. unified cache Intel Pentium Cache Hierarchy Processor Chip L1 Data 1 cycle latency 16 KB 4-way assoc Write-through 32B lines L2 Unified 128KB--2 MB 4-way assoc Write-back Write allocate 32B lines Main Memory Up to 4GB Regs. L1 Instruction 16 KB, 4-way 32B lines

  39. Querying cache parameters Exposed via CPUID x86 instruction • getconf command in linux • grep . /sys/devices/system/cpu/cpu0/cache/index*/*

  40. The Memory Mountain Measuring memory performance of a system Metric used: Read throughput (read bandwidth) • Number of bytes read from memory per second (MB/s) • Measured read throughput is a function of spatial and temporal locality. • Compact way to characterize memory system performance.

  41. Memory Mountain Test Function /* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }

  42. Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */ int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }

  43. The Memory Mountain L1 L2 L3 Ridges of temporal locality Slopes of spatial locality Intel Core i7 2.67GHz, 32KB L1 D-cache, 256KB L2 cache, 8MB L3 cache

  44. Ridges of Temporal Locality Slice through the memory mountain with stride=16 • illuminates read throughputs of different caches and memory

  45. A Slope of Spatial Locality Slice through memory mountain with size=4MB • shows cache block size.

  46. Matrix Multiplication Example Writing memory efficient code Cache effects to consider • Total cache size • Exploit temporal locality and keep the working set small (e.g., by using blocking) • Block size • Exploit spatial locality Description: • Multiply N x N matrices • O(N3) total operations • Accesses • N reads per source element • N values summed per destination • but may be able to hold in register Variable sum held in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } }

  47. k j j i k i A B Miss Rate Analysis for Matrix Multiply Assume: • Double-precision floating point numbers • Line size = 32 bytes (big enough for 4 8-byte doubles) • Matrix dimension (N) is very large • Approximate 1/N as 0.0 • Cache is not even big enough to hold multiple rows Analysis Method: • Look at access pattern of inner loop C

  48. Layout of C Arrays in Memory (review) C arrays allocated in row-major order • each row in contiguous memory locations Stepping through columns in one row: • for (i = 0; i < N; i++) sum += a[0][i]; • accesses successive elements Stepping through rows in one column: • for (i = 0; i < n; i++) sum += a[i][0]; • accesses distant elements • no spatial locality! • compulsory miss rate = 1 (i.e. 100%)

  49. Column- wise Fixed Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0

  50. Row-wise Column- wise Fixed Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0

More Related