1 / 23

Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors

Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University of Texas at Austin Presentation by Pravin Dalale. Motivation Main idea in the paper - Analysis - Main idea

ronli
Download Presentation

Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University of Texas at Austin Presentation by Pravin Dalale

  2. Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion OUTLINE

  3. Motivation Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.

  4. Multithreading Multiple levels of caches Prefetching Solutions

  5. Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion OUTLINE

  6. IPCReal – Instructions per cycle with real memory system IPCPerfectL2 - Instructions per cycle with real L1 cache but perfect L2 cache IPCPerfectMem - Instructions per cycle with perfetct L1 cache but perfect L2 cache Analysis (1)

  7. Fraction of performance lost due to imperfect L1 and L2 = (IPCPerfectMem –IPCReal) /IPCPerfectMem Fraction of performance lost due to imperfect L2 = (IPCPerfectL2 –IPCReal) /IPCPerfectL2 Analysis (2)

  8. Simulated 1.6GHz, out-of-order core 64KB L1 1MB L2 Direct Rambus Memory System with four 1.6GB/s channels The 26 SPEC benchmarks were tested on this system to obtain IPCReal, IPCPerfectL2, IPCPerfectMem. Analysis (3)

  9. Analysis (4) L2 stall fraction is 80% for mcf benchmark Average stall fraction caused by L2 misses is 57% over 26 SPEC CPU2000 benchmarks

  10. Paper describes technique to reduce L2 miss latencies Introduces a prefetch engine to prefetch data to L2 cache upon a L2 demand miss Main idea

  11. Motivation Main idea in the paper - Analysis - Main idea Prefetch Engine - Insertion policy - Prefetch scheduling Results Conclusion OUTLINE

  12. Prefetch Engine 1 3 2

  13. 1 3 2 Prefetch Engine 1 Prefetch queue maintains the list of n region entries not in L2 cache 2 Prefetch prioritizer uses the bank state and the region age to determine which prefetch to issue next. 3 Access prirotizer selects a prefetch in case of no demand misses

  14. The prefetched block may be loaded into L2 with one of four priorities: 1. most-recently-used (MRU) 2. second-most-recently-used (SMRU) 3. second-least-recently-used (SLRU) 4. least-recently-used (LRU) Insertion policy (1)

  15. Insertion policy (2) • Benchmarks were divided into two classes • High (above 20%) prefetch accuracy benchmarks • Low (below 20%) prefetch accuracy benchmarks • All benchmarks were tested for four possible insertion policies. LRU insertion policy gives best results in both categories.

  16. Prefetch Scheduling • Simple aggressive prefetching can consume large amount of bandwidth and cause channel contention • This large contention at channel can be avoided scheduling prefetch accesses onlt when Rambus channels are idle

  17. Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion OUTLINE

  18. Results (1/3) - Overall performance improvement The performance with prefetching is very close to that of perfect L2

  19. Results (2/3) - Sensitivity of prefetch scheme to DRAM latencies • Base DRDRAM had 40ns latency and 800MHz data transfer rate • If latency is increased to 50ns the mean performance of prefetch scheme reduces by less than 1% as compared to the base system • If latency is reduced to 34ns the mean performance of prefetch scheme was again reduced by less than 2%

  20. Results (3/3) - Interaction with software prefetching • When prefetch scheme is coupled with software prefetching, none of the benchmarks improved significantly (at most 2%) • Thus proposed prefetch scheme overshadows the software prefetching benefits

  21. Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion OUTLINE

  22. Conclusions • Authors proposed and evaluated a prefetch architecture, integrated with on chip L2 cache • This architecture involves aggressive prefetching of large regions of data to L2 on demand misses • By scheduling these prefetches only during idle cycles, inserting them into the cache with low replacement prioroty a significant improvement is obtained in 10 of 26 SPEC benchmarks

  23. QUESTIONS

More Related