1 / 56

Machine Structures Lecture 25 – Caches III

Machine Structures Lecture 25 – Caches III. 100 Gig Eth?!  Yesterday Infinera demoed a system that took a 100 GigE signal, broke it up into ten 10GigE signals and sent it over existing optical networks. Fast!!!. gigaom.com/2006/11/14/100gbe. What to do on a write hit?. Write-through

Download Presentation

Machine Structures Lecture 25 – Caches III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine StructuresLecture 25 – Caches III 100 Gig Eth?!Yesterday Infinerademoed a system that took a 100 GigE signal, broke it up into ten 10GigE signals and sent it over existing optical networks. Fast!!! gigaom.com/2006/11/14/100gbe

  2. What to do on a write hit? • Write-through • 同时更新cache块和对应的内存字 • Write-back • 更新cache块中的字 • 允许内存中的字为“stale”  add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced  OS flushes cache before I/O… • Performance trade-offs?

  3. Block Size Tradeoff (1/3) • Larger Block Size优点 • 空间局部性: 如果访问给定字, 可能很快会访问附近的其他字 • 适用于存储程序概念: 如果执行一条指令, 极有可能会执行接下来的几条指令 • 对于线性数组的访问也是合适的

  4. Block Size Tradeoff (2/3) • Larger Block Size缺点 • 大块意味着更大的miss penalty • 在miss时, 需要花更长的时间从内存中装入新块 • 如果块太大,块的数量就少 • 结果: miss率上升 • 总的来说,应最小化平均访问时间 Average Memory Access Time (AMAT) = Hit Time + Miss Penalty x Miss Rate

  5. Block Size Tradeoff (3/3) • Hit Time = 从cache中找到并提取数据所花的时间 • Miss Penalty = 在当前层miss时,取数据所花的平均时间 (包括在下一层也出现misses的可能性) • Hit Rate = % of requests that are found in current level cache • Miss Rate = 1 - Hit Rate

  6. Valid Bit Cache Data Tag B 3 B 2 B 1 B 0 Extreme Example: One Big Block • Cache Size = 4 bytes Block Size = 4 bytes • Only ONE entry (row) in the cache! • If item accessed, likely accessed again soon • But unlikely will be accessed again immediately! • The next access will likely to be a miss again • Continually loading data into the cache butdiscard data (force out) before use it again • Nightmare for cache designer: Ping Pong Effect

  7. Miss Rate Miss Penalty Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Block Size Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size Tradeoff Conclusions

  8. Types of Cache Misses (1/2) • “Three Cs” Model of Misses • 1st C: Compulsory Misses • occur when a program is first started • cache does not contain any of that program’s data yet, so misses are bound to occur • can’t be avoided easily, so won’t focus on these in this course

  9. Types of Cache Misses (2/2) • 2nd C: Conflict Misses • 因为两个不同的内存地址映射到相同的cache地址而发生的miss • 两个块可能不断互相覆盖 (不幸地映射到同一位置) • big problem in direct-mapped caches • 如何减小这个效果? • 处理Conflict Misses • Solution 1: 增加cache • Fails at some point • Solution 2: 多个不同的块映射到同样的cache索引?

  10. 完全关联(Fully Associative)Cache (1/3) • 内存地址字段: • Tag: same as before • Offset: same as before • Index: non-existant • What does this mean? • 没有 “rows”索引: 每个块都可以映射到cache的任一行 • 必须要比较整个cache的所有行来发现数据是否在cache中

  11. 4 31 0 Byte Offset Cache Tag (27 bits long) Cache Data Valid Cache Tag : B 0 B 31 B 1 = = = = : : : : = Fully Associative Cache (2/3) • Fully Associative Cache (e.g., 32 B block) • compare tags in parallel

  12. Fully Associative Cache (3/3) • Fully Assoc Cache优点 • No Conflict Misses (因数据可在任何地方) • Fully Assoc Cache缺点 • 需要硬件比较来比较所有入口: 如果在cache中有64KB数据, 4B 偏移, 需要16K比较器: infeasible

  13. Accessing data in a Fully assoc cache • 6 Addresses: • 0x00000014, 0x0000001C, 0x00000034, 0x00008014, 0x00000030, 0x0000001c • 6地址分解为 (为方便起见) Tag,字节偏移域 0000000000000000000000000001 0100 0000000000000000000000000001 1100 0000000000000000000000000011 0100 0000000000000000100000000001 0100 0000000000000000000000000011 0000 0000000000000000000000000001 1100 Tag Offset

  14. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 ... ... 1022 1023 0 0 16 KB Fully Asso.Cache, 16B blocks • Valid bit:确定该行是否存有数据 (当计算机初始化时, 所有行都为invalid) Index

  15. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 ... ... 1022 1023 0 0 1. Read 0x00000014 Offset Tag field • 0000000000000000000000000001 0100 Index

  16. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 ... ... 1022 1023 0 0 Find the block to match tag (00…001) Tag field Offset • 0000000000000000000000000001 0100

  17. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 ... ... 1022 1023 0 0 No. So we find block to fill 0 (0000000000) Tag field Offset • 0000000000000000000000000001 0100

  18. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 So load that data into cache, setting tag, valid • 0000000000000000000000000001 0100 Tag field Offset 1 1 a b c d 0 0 0 0 0 0 0 0 0

  19. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 get that data in the offset (0100) • 0000000000000000000000000001 0100 Tag field Offset 1 1 a b c d 0 0 0 0 0 0 0 0 0

  20. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 2. Read 0x0000001C = 0…00 0..001 1100 Offset Tag field • 0000000000000000000000000001 1100 1 1 a b c d 0 0 0 0 0 0 0 0 0

  21. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Find the tag in cache Offset Tag field • 0000000000000000000000000001 1100 Index 1 a b c d 1 0 0 0 0 0 0 0 0 0

  22. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 The data have found in Cache Offset Tag field • 0000000000000000000000000001 1100 Index 1 a b c d 1 0 0 0 0 0 0 0 0 0

  23. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Get the data from Offset (1100) Offset Tag field • 0000000000000000000000000001 1100 Index 1 a b c d 1 0 0 0 0 0 0 0 0 0

  24. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 3. Read 0x00000034 = 0…000..011 0100 • 0000000000000000000000000011 0100 Offset Tag field Index 1 1 a b c d 0 0 0 0 0 0 0 0 0

  25. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Find the tag 3 (0011) • 0000000000000000000000000011 0100 Offset Tag field Index 1 a b c d 1 0 0 0 0 0 0 0 0 0

  26. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Not Find (0011) • 0000000000000000000000000011 0100 Offset Tag field Index 1 a b c d 1 0 0 0 0 0 0 0 0 0

  27. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 So load that data into cache, setting tag, valid • 0000000000000000000000000011 0100 Offset Tag field Index 1 a b c d 1 g 1 3 e f h 0 0 0 0 0 0 0 0

  28. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Get the data from Offset (0100) • 0000000000000000000000000011 0100 Offset Tag field Index 1 a b c d 1 g 1 3 e f h 0 0 0 0 0 0 0 0

  29. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 4. Read 0x00008014 = 0…10 0..001 0100 • 0000000000000000100000000001 0100 Tag field Offset Offset Index 1 a b c d 1 g 1 3 e f h 0 0 0 0 0 0 0 0

  30. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Find Tag(0x801), Not Found • 0000000000000000100000000001 0100 Tag field Offset Offset Index 1 a b c d 1 g 1 3 e f h 0 0 0 0 0 0 0 0

  31. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 So load that data into cache, setting tag, valid • 0000000000000000100000000001 0100 Offset Tag field Index 1 a b c d 1 g 1 3 e f h k 1 801 i j l 0 0 0 0 0 0 0

  32. Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag 0 1 2 3 4 5 6 7 ... ... 1022 1023 Get data from cache • 0000000000000000100000000001 0100 Offset Tag field Index 1 a b c d 1 g 1 3 e f h k 1 801 i j l 0 0 0 0 0 0 0

  33. Do an example yourself. What happens? • Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l • Read address0x00000030 ? 000000000000000000 0000000011 0000 • Read address0x0000001c ?000000000000000000 0000000001 1100 Cache Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag Index a b c d 0 1 1 g 1 1 3 e f h 2 801 i j k l 1 0 3 4 0 5 0 6 0 7 0 ... ...

  34. 00000010 a 00000014 b 00000018 c 0000001c d 00000030 e 00000034 f ... ... 00000038 g 0000003c h 00008010 i 00008014 j 00008018 k 0000801c l ... ... ... ... ... ... Answers Memory • 0x00000030 a hit tag = 3, Tag found, Offset = 0, value = e • 0x0000001c a hit tag = 1, Tag found, Offset = 0xc, value = d • Since reads, values must = memory values whether or not cached: • 0x00000030 = e • 0x0000001c = d Value of Word Address

  35. Third Type of Cache Miss • Capacity Misses • miss that occurs because the cache has a limited size • miss that would not occur if we increase the size of the cache • sketchy definition, so just get the general idea • This is the primary type of miss for Fully Associative caches.

  36. N-Way Set Associative Cache (1/3) • Memory address fields: • Tag: same as before • Offset: same as before • Index: 指向正确的行“row” (此处称为集合) • 有什么区别? • 每个set包含多个块 • 当找到正确的块后, 需要比较该set中的所有tag以查找数据

  37. Cache Index Memory Address Memory 0 0 0 1 1 2 1 3 4 5 6 7 8 9 A B C D E F Associative Cache Example • Here’s a simple 2 way set associative cache.

  38. N-Way Set Associative Cache (2/3) • Basic Idea • cache is direct-mapped w/respect to sets • each set is fully associative • basically N direct-mapped caches working in parallel: each has its own valid bit and data • 给定内存地址: • 使用Index找到正确的set. • 在所找到的set中比较所有的Tag. • 如果找到 hit!, 否则miss. • 最后, 和前面一样使用偏移字段在块中找到所需的数据.

  39. N-Way Set Associative Cache (3/3) • What’s so great about this? • even a 2-way set assoc cache avoids a lot of conflict misses • hardware cost isn’t that bad: only need N comparators • In fact, for a cache with M blocks, • it’s Direct-Mapped if it’s 1-way set assoc • it’s Fully Assoc if it’s M-way set assoc • so these two are just special cases of the more general set associative design

  40. 4-Way Set Associative Cache Circuit tag index

  41. Block Replacement Policy • Direct-Mapped Cache: index completely specifies position which position a block can go in on a miss • N-Way Set Assoc: index specifies a set, but block can occupy any position within the set on a miss • Fully Associative: block can be written into any position • Question: 如果有多个选择, 我们该写到何处? • 如果有空位, 通常总是写到第一个空位. • If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.

  42. Block Replacement Policy: LRU • LRU (Least Recently Used) • Idea: cache out block which has been accessed (read or write) least recently • Pro: temporal locality recent past use implies likely future use: in fact, this is a very effective policy • Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

  43. Block Replacement Example • We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will there be for the LRU block replacement policy?

  44. set 0 set 1 lru lru set 0 0 set 1 lru lru set 0 0 2 set 1 lru set 0 0 2 set 1 lru lru 4 2 lru lru set 0 set 0 0 0 4 lru lru set 1 set 1 1 1 Block Replacement Example: LRU loc 0 loc 1 lru 0 0: miss, bring into set 0 (loc 0) • Addresses 0, 2, 0, 1, 4, 0, ... 2 2: miss, bring into set 0 (loc 1) 0: hit 1: miss, bring into set 1 (loc 0) lru 1 4: miss, bring into set 0 (loc 1, replace 2) 0: hit

  45. Big Idea • How to choose between associativity, block size, replacement & write policy? • Design against a performance model • Minimize:Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate • influenced by technology & program behavior • Create the illusion of a memory that is large, cheap, and fast - on average • How can we improve miss penalty?

  46. DRAM $ $2 Improving Miss Penalty • When caches first became popular, Miss Penalty ~ 10 processor clock cycles • Today 2400 MHz Processor (0.4 ns per clock cycle) and 80 ns to go to DRAM  200 processor clock cycles! MEM Proc Solution: another cache between memory and the processor cache: Second Level (L2) Cache

  47. L2 hit time L2 Miss Rate L2 Miss Penalty Analyzing Multi-level cache hierarchy DRAM Proc $ $2 L1 hit time L1 Miss Rate L1 Miss Penalty Avg Mem Access Time= L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty= L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time= L1 Hit Time + L1 Miss Rate* (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)

  48. And in Conclusion… • We’ve discussed memory caching in detail. Caching in general shows up over and over in computer systems • Filesystem cache • Web page cache • Game databases / tablebases • Software memoization • Others? • Big idea: if something is expensive but we want to do it repeatedly, do it once and cache the result. • Cache design choices: • Write through v. write back • size of cache: speed v. capacity • direct-mapped v. associative • for N-way set assoc: choice of N • block replacement policy • 2nd level cache? • 3rd level cache? • Use performance model to pick between choices, depending on programs, technology, budget, ...

  49. BONUS SLIDES

  50. Example • Assume • Hit Time = 1 cycle • Miss rate = 5% • Miss penalty = 20 cycles • Calculate AMAT… • Avg mem access time = 1 + 0.05 x 20 = 1 + 1 cycles = 2 cycles

More Related