1 / 43

Prefetch -Aware DRAM Controllers

Prefetch -Aware DRAM Controllers. Chang Joo Lee Onur Mutlu* Veynu Narasiman Yale N. Patt. Electrical and Computer Engineering The University of Texas at Austin. *Microsoft Research and Carnegie Mellon University. Outline. Motivation Mechanism Experimental Evaluation Conclusion.

ermab
Download Presentation

Prefetch -Aware DRAM Controllers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu* Veynu Narasiman Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin *Microsoft Research and Carnegie Mellon University

  2. Outline • Motivation • Mechanism • Experimental Evaluation • Conclusion

  3. Modern DRAM Systems DRAM Bank • Rows and columns of DRAM cells • A row buffer in each bank • Non-uniform access latency: • Row-hit: • Data is in the row buffer • Row-conflict: • Data is not in the row buffer • Needs to access the DRAM cells • Row-hit latency < Row-conflict latency Row B Row Buffer Row A Row-conflict Row-hit Data Bus Processor: Row A Processor: Row B Prioritize row-hit accesses to increase DRAM throughput [Rixner et al. ISCA2000]

  4. Problems of Prefetch Handling • How to schedule prefetches vs demands? • Demand-first: Always prioritizes demands over prefetch requests • Demand-prefetch-equal: Always treats them the same Neither of these perform best Neither take into account both: 1. Non-uniform access latency of DRAM systems 2. Usefulness of prefetches

  5. When Prefetches are Useful Stall Execution DRAM • Demand-first Row B Row A Row Buffer 2 row-conflicts, 1 row-hit DRAM Row-conflict Row-hit Processor DRAM Controller Miss Y Miss X Miss Z Pref Row A : X Dem Row B : Y Pref Row A : Z Processor needs Y, X, and Z

  6. When Prefetches are Useful Stall Execution DRAM • Demand-first Row A Row B Row Buffer 2 row-conflicts, 1 row-hit DRAM Row-conflict Row-hit Processor DRAM Controller Miss Y Miss X Miss Z Demand-pref-equal outperforms demand-first Pref Row A : X Dem Row B : Y • Demand-pref-equal Pref Row A : Z 2 row-hits, 1 row-conflict DRAM Processor Processor needs Y, X, and Z Saved Cycles Miss Y Hit X Hit Z

  7. When Prefetches are Useless DRAM • Demand-first Row A Row Buffer DRAM Y X Z Processor DRAM Controller Saved Cycles Miss Y Pref Row A : X Demand-first outperforms demand-pref-equal Dem Row B : Y • Demand-pref-equal Pref Row A : Z DRAM X Z Y Processor Processor needs ONLY Y Miss Y

  8. Demand-first vs. Demand-pref-equal policy Stream prefetcher enabled Useless prefetches: Off-chip bandwidth Queue resources Cache Pollution Goal 1: Adaptively schedule prefetches based on prefetch usefulness Demand-pref-equal is better Goal 2: Eliminate useless prefetches Demand-first is better

  9. Goals 1. Maximize the benefits of prefetching: Increase DRAM throughput by adaptively scheduling requests based on prefetch usefulness → increase timeliness of useful prefetches 2. Minimize the harm of prefetching: Adaptively delay the service of useless prefetches and remove useless prefetches → increase efficiency of resource utilization Achieve higher performance and efficiency

  10. Outline • Motivation • Mechanism • Experimental Evaluation • Conclusion

  11. Prefetch-Aware DRAM Controllers (PADC) To DRAM • Adaptive Prefetch Scheduling (APS): Prioritizes prefetch and demand requests based on prefetch accuracy estimation • Adaptive Prefetch Dropping (APD): Cancels likely-useless prefetches from memory request buffer based on prefetch accuracy Update APS Request priority Memory request buffer APD Drop Request Info PADC Prefetch accuracy from each core

  12. Prefetch Accuracy Estimation #Prefetches used • Prefetch accuracy = • Hardware support: • Prefetch bit (per L2 cache line, MSHR entry): Indicates whether it is a prefetch or demand • Prefetch sent counter (per core) • Prefetch used counter (per core) • Prefetch accuracy register (per core) • Estimated every 100K cycles #Prefetches sent

  13. To DRAM Update APS Request priority Memory request buffer APD Drop Request Info PADC Prefetch accuracy from each core Adaptive Prefetch Scheduling (APS) 1. Adaptively change the priority of prefetch requests • Low prefetch accuracy → prioritize demands from the core • High prefetch accuracy → treat demands and prefetches equally 2. In a CMP system: prioritize demand requests from a core that has many useless prefetches • To avoid starving demand requests from a core with low prefetch accuracy → improves system performance

  14. Adaptive Prefetch Scheduling (APS) 1. Critical requests • All demand requests • Prefetch requests from cores whose prefetch accuracy ≥ promotion threshold 2. Urgent requests • Demand requests from cores whose prefetch accuracy < promotion threshold

  15. C RH U FCFS Adaptive Prefetch Scheduling (APS) • Each memory request buffer entry: priority fields • Prioritization order: 1. Critical request (C) 2. Row-hit request (RH) 3. Urgent request (U) 4. Oldest request (FCFS)

  16. To DRAM Update APS Request priority Memory request buffer APD Drop Request Info PADC Prefetch accuracy from each core Adaptive Prefetch Dropping (APD) • Proactively drops old prefetches based on prefetch accuracy estimation • Old requests are likely useless • APS prioritizes demand requests when prefetch accuracy is low • A prefetch that is hit by a demand is promoted to a demand • Dropping old, useless prefetches saves resources(bandwidth, queues, caches) • Saved resources can be used by useful requests

  17. P ID AGE Adaptive Prefetch Dropping (APD) • Each memory request buffer entry: drop information • Prefetch bit (P) • Core ID field (ID) • Age field (AGE) • Drop prefetch requests whoseAGE > Drop threshold • Drop threshold is dynamically determined based on prefetch accuracy estimation • Lower accuracy → Lower threshold

  18. Hardware Cost for 4-core CMP • Total storage: 34,720 bits (~4.25KB) are needed • ~ 4KB are prefetch bits in each cache line • If prefetch bits are already implemented: ~228B • Logic is not on the critical path • Scheduling and dropping decisions are made every DRAM bus cycle

  19. Outline • Motivation • Mechanism • Experimental Evaluation • Conclusion

  20. Simulation Methodology • x86 cycle accurate simulator • Baseline processor configuration • Per core • 4-wide issue, out-of-order, 256-entry ROB • 512KB, 8-way unified L2 cache (1MB for single core processor) • Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64) • Shared • On-chip, demand-first FR-FCFS memory controller • 64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core • DDR3 1333, 15-15-15ns, 4KB row buffer • PADC configuration • Promotion threshold: 85% • Drop threshold:

  21. Workloads for Evaluation • Single-core processor: All 55 SPEC 2000/2006 benchmarks • Single-threaded • 38 prefetch sensitive benchmarks • 17 prefetch insensitive benchmarks • CMP: Randomly chosen multiprogrammed workloads from 55 benchmarks: • 4-core CMP: 32 workloads • 8-core CMP: 21 workloads

  22. Performance of PADC 4.3% 8.2% 9.9%

  23. Bus Traffic of PADC -10.4% -10.7% -9.4%

  24. Performance with Other Prefetchers 4-core CMP 6.0% 6.6% 2.2%

  25. Bus Traffic with Other Prefetchers 4-core CMP -5.7% -6.8% -10.3%

  26. Outline • Motivation • Mechanism • Experimental Evaluation • Conclusion

  27. Conclusions • Prefetch-Aware DRAM Controllers (PADC) • Adaptive Prefetch Scheduling • Increase DRAM throughput by exploiting row-buffer locality when prefetches are useful • Delay service of prefetches when they are useless • Adaptive Prefetch Dropping • With APS, remove useless prefetches effectively while keeping the benefits of useful prefetches • Improve performance and bandwidth efficiency for both single-core and CMP systems • Low cost and easily implementable

  28. Questions?

  29. Performance Detail • Single-core: • 38 prefetch-sensitive: 6.2% • Prefetch-friendly: 29 benchmarks • Prefetch-unfriendly: 9 benchmarks • 17 out of 38 are memory intensive (MPKI > 10) : 11.8% • 17 prefetch-insensitive

  30. Two Channel Memory Performance 16% 31% 5.9% 5.5%

  31. Two Channel Memory Bus Traffic -12.9% -13.2%

  32. Comparison with Feedback Directed Prefetching 4-core CMP 6.4%

  33. Performance on Single-Core

  34. Prefetch Friendly Application • libquantum

  35. Prefetch Unfriendly Application • art

  36. Average Performance on Single-Core • All 55 SPEC 2000/2006 CPU benchmarks

  37. System Performance on 4-Core CMP • 32 randomly chosen 4-core workloads

  38. System Performance on 8-core CMP • 21 randomly chosen 8-core workloads

  39. Prefetch Friendly Application • leslie3d

  40. Prefetch Unfriendly Application • ammp

  41. Performance on 4-Core • omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP

  42. Performance on 4-Core • omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP

  43. System Performance on 4-Core • omnetpp, libquantum, galgel, and GemsFDTD

More Related