1 / 22

Non-Uniform Cache Architectures for Wire Delay Dominated Caches

Non-Uniform Cache Architectures for Wire Delay Dominated Caches. Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller. Plan. Motivation What is NUCA UCA and ML-UCA Static NUCA Dynamic NUCA Simulation Results. Motivation. Bigger L2 and L3 Caches are needed Programs are larger

adler
Download Presentation

Non-Uniform Cache Architectures for Wire Delay Dominated Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-Uniform Cache Architecturesfor Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller

  2. Plan • Motivation • What is NUCA • UCA and ML-UCA • Static NUCA • Dynamic NUCA • Simulation Results

  3. Motivation • Bigger L2 and L3 Caches are needed • Programs are larger • SMT requires large cache for spatial locality • BW demands have increased on the package • Smaller technologies permit more bits per mm2 • Wire delays dominate in large caches • Bulk of the access time will involve routing to and from the banks, not the bank accesses themselves

  4. What is NUCA? Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor Example: The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.

  5. UCA and ML-UCA L2 41 L3 41 L2 10 ML-UCA Avg. access time: 11/41 cycles Banks: 8/32 Size: 16MB Technology: 50nm UCA Avg. access time: 255 cycles Banks: 1 Size: 16MB Technology: 50nm

  6. 17 41 … Static-NUCA-1 S-NUCA-1 Avg. access time: 34 cycles Banks: 32 Size: 16MB Technology: 50nm Area: Wire overhead 20.9%

  7. Sub-bank Bank Data Bus Predecoder Address Bus Sense amplifier Tag Array Wordline driver and decoder S-NUCA-1 cache design

  8. 9 32 … … … Static-NUCA-2 S-NUCA-2 Avg. access time: 24 cycles Banks: 32 Size: 16MB Technology: 50nm Area: Channel overhead 5.9%

  9. Tag Array Bank Switch Data bus Predecoder Wordline driver and decoder S-NUCA-2 cache design Addressbus Sense amplifier

  10. Data migration … 4 47 … … … Dynamic-NUCA D-NUCA Avg. access time: 18 cycles Banks: 256 Size: 16MB Technology: 50nm

  11. Management of Data in DNUCA • Mapping: • How the data are mapped to the banks and in which banks a datum can reside? • Search: • How the set of possible locations are searched to find a line? • Movement: • Under what conditions the data should be migrated from one bank to another?

  12. Simple Mapping (implemented) memory controller bank one set way 1 way 2 way 3 way 4 8 bank sets

  13. Fair and Shared Mapping memory controller memory controller Fair Mapping Shared Mapping

  14. Searching Cached Lines • Incremental search • Multicast search (Implemented) • Limited multicast • Partitioned multicast Smart Search: • ss-performance • ss-energy

  15. Dynamic Movement of Lines • LRU line furthest and MRU line closest • One-bank promotion on a hit (implemented) Policy on miss: • Which line is evicted? • Line in the furthest (slowest) bank -- (implemented) • Where is the new line placed? • Closest (fastest) bank • Furthest (slowest) bank -- (implemented) • What happens to the victim line? • Zero copy policy (implemented) • One copy policy

  16. Advantages of DNUCA over ML-UCA • DNUCA does not enforce inclusion thus preventing redundant copies of the same line • In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses

  17. Configuration for simulation • Used Sim-Alpha and Cacti • Simple mapping • Multicast search • One-bank promotion on each hit • Replacement policy that chooses the block in the slowest bank as the victim of a miss

  18. Hit Rate Distribution for D-NUCA

  19. Simulation results – integer benchmarks

  20. Simulation results – FP benchmarks

  21. Summary D-NUCA has the following plus points: • Low Access Latency • Technology scalability • Performance stability • Flattens the memory hierarchy

More Related