1 / 21

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches. ASPLOS’02 Presented by Kim, Sun- Hee. Introduction. Technology trends The rate of frequency scaling is slowing down Performance must come from exploiting concurrency

afric
Download Presentation

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches ASPLOS’02 Presented by Kim, Sun-Hee

  2. Introduction • Technology trends • The rate of frequency scaling is slowing down • Performance must come from exploiting concurrency • Increasing global on-chip wire delay problem • Architectures must be partitioned • NUCA (Non-Uniform access Cache Architecture) • Composable on-chip memories • Address the increasing wire delay problem in future large caches • Array of fine-grained memory banks connected by a switched network

  3. Level-2 Cache Architectures(1/5) • UCA (Uniform Cache Access) • Traditional cache • Poor performance • Internal wire delays • Restricted numbers of ports

  4. Level-2 Cache Architectures(2/5) • ML-UCA (Multi-level Cache) • L2 and L3 • Aggressively baked • Multiple parallel access • Inclusion, replicating

  5. Level-2 Cache Architectures(3/5) • S-NUCA-1 (Static Non-Uniform Cache) • Non-uniform access without inclusion • Mapping is predetermined • Based on the block index • Only one bank of the cache • Private, two-way, pipelined transmission channel

  6. Level-2 Cache Architectures(4/5) • S-NUCA-2 • 2D switched network • Permitting a larger number of smaller, faster banks • Circumvent wire & decoder area overhead

  7. Level-2 Cache Architectures(5/5) • D-NUCA (Dynamic NUCA) • Migrating cache lines • By data to be mapped to many banks • Most requests are serviced by the fastest banks • Fewer misses • By adopting to the working set

  8. UCA • Experimental Methodology • Cacti to derive the access times for cache • sim-alpha to simulate cache performance • UCA Evaluation

  9. S-NUCA • Mappings of data to banks are static • Low-order bits index determine bank • Four-way set associative • Advantages • Different access time proportional to the distance of the bank • Access to different banks may in parallel • Reducing contention

  10. S-NUCA-1 (Private Channel) • 2 private, per-bank 128-bit channels • Each bank access independently at max speed • Small bank advantages Vs. area overheads • Bank conflict contention model • Conservative policy : b+2d+3 cycles • Aggressive pipelining policy : b+3 cycles

  11. S-NUCA-2 (Switched Channel) • Lightweight, wormhole-routed 2-D mesh • Centralized tag store or broadcasting the tags to all of the banks

  12. D-NUCA : Mapping • Spread sets • The multibanked cache as a set-associative • Bank set Bank set, 4-way Rows# may not ways Different latencies Equal latencies Complex path in a set Potential longer latencies More contention Fastest bank access

  13. D-NUCA : Locating • Incremental search • From the closest bank • Minimize messages, low energy and performance • Multicast search • Multicast address to banks in a set • Higher performance at more energy and contention • Limited multicast • Search first M banks in parallel then incremental • Partitioned multicast • Subset in bank set is searched iteratively

  14. D-NUCA : Searching • Challenges in distributed cache array • Many banks may need to be searched • Miss resolution time grows as way increase • Partial tag comparison • Reduce bank lookups and miss resolution time • Smart search • Stores the partial tag bits in the cache controller • ss-performance : enough tag bits reducing false hit • ss-energy : serialized search from the closest bank

  15. D-NUCA : Movement • Maximize the hit ratio in the closest bank • MRU line is in the closest bank • Generational promotion • Approximating an LRU mapping • Reduce the copying # by pure LRU • On hit, swapped with the line in the next closest bank • Zero-copy policy, one-copy policy

  16. D-NUCA : Policies • Mapping • Simple or shared • Search • Multicast, incremental, or combination • Promotion • Promotion distance(1bank), promotion trigger(1hit) • Insertion • Location (slowest bank) and replacement (zero copy) • Compare to pure LRU

  17. Evaluations (1/2) UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 0.41 S-NUCA : 0.65

  18. Evaluations (2/2) • Comparison to ML-UCA • Same with D-NUCA in frequently used data is closer Working set > 2MB

  19. Summary and Conclusions • Low latency access • Technology scalability • Performance stability • Flattening the memory hierarchy

  20. Evaluations (2/)

  21. Evaluations (3/3) • Cache Design Comparison

More Related