1 / 14

CS 7810 Lecture 17

CS 7810 Lecture 17. Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004. Cache Design. Data Array. Tag Array. D E C O D E R. D E C O D E R. Address. Comparator. Sense Amp. Mux+driver. Data. Capacity Vs. Latency. 8KB

gram
Download Presentation

CS 7810 Lecture 17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004

  2. Cache Design Data Array Tag Array D E C O D E R D E C O D E R Address Comparator Sense Amp Mux+driver Data

  3. Capacity Vs. Latency 8KB 1 cycle 32 KB 2 cycles 128 KB 3 cycles

  4. Large L2 Caches • Issues to be addressed for • Non-Uniform Cache Access: • Mapping • Searching • Movement CPU

  5. Dynamic NUCA • Frequently accessed blocks are moved closer to • CPU – reduces average latency • Partial (6-bit) tags are maintained close to CPU – • tag look-up can identify potential location of block • or quickly signal a miss • Without partial tags, every possible location would • have to be searched serially or in parallel • What if you optimize for power?

  6. DNUCA – CMP Latency 65 cyc Allocation: static, based on block’s address Migration: r.l  r.i  r.c  m.c  m.i  m.l Search: multicast to 6; then multicast to 10 False misses Latency 13-17cyc

  7. Alternative Layout From Huh et al., ICS’05

  8. Block Sharing

  9. Hit Distribution

  10. Block Migration Results While block migration reduces avg. distance, it complicates search.

  11. CMP-TLC Pros: Fast wires enable uniform low-latency access Cons: Low-bandwidth interconnect High implementation cost More latency/complexity at the L2 interface

  12. Stride Prefetching • Prefetching algorithm: detect at least 4 uniform stride accesses and then • allocate an entry in stream buffer • Stream buffer has 8 entries and each stream stays 6 (L1) or 25 (L2) • accesses ahead

  13. Combination of Techniques

  14. Title • Bullet

More Related