1 / 25

High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP)

High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP). Aamer Jaleel, Kevin Theobald, Simon Stelly Jr., Joel Emer. Intel Corporation, VSSAD. International Symposium on Computer Architecture ( ISCA – 2010 ). HIgh PErformance

Download Presentation

High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HighPerformance Cache Replacement Using Re-Reference Interval Prediction(RRIP) Aamer Jaleel, Kevin Theobald, Simon Stelly Jr., Joel Emer Intel Corporation, VSSAD International Symposium on Computer Architecture ( ISCA – 2010 ) HIgh PErformance Computing LAB

  2. Motivation • Factors making caching important • Increasing ratio of CPU speed to memory speed • Multi-core poses challenges on better shared cache management • LRU has been the standard replacement policy at LLC • However LRU has problems http://hipe.korea.ac.kr

  3. Problems with LRU Replacement http://hipe.korea.ac.kr

  4. Desired Behavior from Cache Replacement http://hipe.korea.ac.kr

  5. Prior Solutions • Working set larger than the cache Preserve some of working set in the cache • Dynamic Insertion Policy(DIP)  Thrash-resistance with minimal changes to HW • Recurring scans  Preserve frequently referenced working set in the cache • Least Frequently Used(LFU)  addresses scans • LFU adds complexity and also performs bad for recency friendly workloads • GOAL: Design a high performing scan-resistant policy that requires minimum changes to HW http://hipe.korea.ac.kr

  6. Belady’s Optimal(OPT) Replacement Policy • Replacement decisions using perfect knowledge of future reference order • Victim Selection Policy • Replaces block that will be re-referenced furthest in future http://hipe.korea.ac.kr

  7. Practical Cache Replacement Policies • Replacement decisions using predicting the future reference order • Victim Selection Policy • Replaces block predicted to be re-referenced furthest in future • Continually update predictions on the future reference order • Natural update opportunities are on cache fills and cache hits http://hipe.korea.ac.kr

  8. LRU Replacement in Prediction Framework • The LRU chain maintains the re-reference prediction • Head of chain(MRU) predicted to be re-referenced soon • Tail of chain(LRU) predicted to re-referenced far in the future • LRU predicts that blocks are re-referenced in reverse order of reference • Rename LRU chain to the “Re-Reference Prediction(RRP) Chain” http://hipe.korea.ac.kr

  9. Practicality of Chain based Replacement • Problem: Chain based replacement is too expensive • Log2(associativity) bits required per cache block • Solution: LRU chain positions can be quantized into different buckets • Each bucket corresponds to a predicted Re-Reference Interval • Value of bucket is called the Re-Reference Prediction Value(RRPV) • Hardware Cost: ‘n’ bits per block http://hipe.korea.ac.kr

  10. Representation of Quantized Replacement (n = 2) http://hipe.korea.ac.kr

  11. Emulating LRU with Quantized Buckets (n=2) • Victim Selection Policy: Evict block with distant RRPV • If no distant RRPV(2n-1=‘3’) found, increment all RRPVs and repeat • If multiple found, need tie breaker. In this paper, searched from physical way ‘0’ • Insertion Policy: Insert new block with RRPV=‘0’ • Update Policy: Cache hits update the block’s RRPV=‘0’ 0 1 2 3 4 5 6 7 Physical Way # Cache Tag f g c h d e a b RRPV victim block hit 1 0 2 0 1 1 3 3 s 0 http://hipe.korea.ac.kr 0

  12. Re-Reference Interval Prediction(RRIP) • Framework enables re-reference predictions to be tuned at insertion/update • Unlike LRU, can use non-zero RRPV on insertion • Unlike LRU, can use a non-zero RRPV on cache hits • Static Re-Reference Interval Prediction(SRRIP) • Determine best insertion/update prediction using profiling • Dynamic Re-Reference Interval Prediction(DRRIP) • Dynamically determine best re-reference prediction at insertion http://hipe.korea.ac.kr

  13. Static RRIP Insertion Policy • Key idea: Do not give new blocks too much(or too little)time in the cache • Predict new cache block will not be re-referenced soon • Insert new block will some RRPV other than ‘0’ • Similar to inserting in the “middle” of the RRP chain(however, it is not identical to a fixed insertion position on RRP chain) 0 1 2 3 4 5 6 7 Physical Way # Cache Tag b c e a g d f h RRPV victim block 3 2 1 3 0 1 1 0 s http://hipe.korea.ac.kr 2

  14. Static RRIP Insertion Policy • Key idea: Do not give new blocks too much(or too little)time in the cache • Predict new cache block will not be re-referenced soon • Insert new block will some RRPV other than ‘0’ • Similar to inserting in the “middle” of the RRP chain(however, it is not identical to a fixed insertion position on RRP chain) 0 1 2 3 4 5 6 7 Physical Way # Cache Tag b c e a g d f h RRPV victim block 3 2 1 3 0 1 1 0 s http://hipe.korea.ac.kr 2

  15. Static RRIP Update Policy on Cache Hits • Hit Priority (HP) • Like LRU, Always update RRPV=0 on cache hits. • Intuition: Predicts that blocks receiving hits after insertion will be re-referenced soon 0 1 2 3 4 5 6 7 Physical Way # Cache Tag f g c h d e a b hit RRPV 1 0 2 0 1 1 3 3 0 http://hipe.korea.ac.kr

  16. Evaluation • Simulator • CMP$im • Baseline Processor • 4 way OOO with 128-entry reorder buffer • L1: 4way 32KB for each instruction and data cache (1cycle) • L2: 8way 256KB (10 cycles) • L3: 16way 2MB(for single core) / 8MB(for 4-core) (24 cycles) • Line size: 64B • 250 cycles penalty to main-memory http://hipe.korea.ac.kr

  17. Evaluation • Simulator • CMP$im • Baseline Processor • 4 way OOO with 128-entry reorder buffer • L1: 4way 32KB for each instruction and data cache (1cycle) • L2: 8way 256KB (10 cycles) • L3: 16way 2MB(for single core) / 8MB(for 4-core) (24 cycles) • Line size: 64B • 250 cycles penalty to main-memory http://hipe.korea.ac.kr

  18. Evaluation • Workload • 250M instructions http://hipe.korea.ac.kr

  19. SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC http://hipe.korea.ac.kr

  20. SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC

  21. Why does RRPVinsertion of 2n-2 work best? • Before scan, re-reference prediction of active working set is ‘0’ • Recall, NRU (n=1) is not scan-resistant • For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks • Larger insertion RRPV tolerates larger scans • Maximum insertion prediction works best http://hipe.korea.ac.kr

  22. DRRIP: Extending Scan-Resistant SRRIP to be Thrash-Resistant • Always using same prediction for all insertions will thrasheds the cache • Need to preserve some fraction of working set in cache • Dynamic Re-Reference Interval Prediction • Dynamically select between inserting blocks with 2n-1 and 2n-2 using Set Dueling • Inserting blocks with 2n-1 is same as “no update insertion” http://hipe.korea.ac.kr

  23. Performance Comparison of Replacement Policies http://hipe.korea.ac.kr

  24. Total Storage Overhead(16-way Set Associative Cache) • LRU: 4bits / cache block • NRU: 1bit / cache block • DRRIP-3: 3bits / cache block • DRRIP Outperforms LRU with less storage than LRU http://hipe.korea.ac.kr

  25. Summary • Scan-resistance is an important problem in commercial workloads • State-of-the art policies do not address scan-resistance • Propose a Simple and Practical Replacement Policy • Static RRIP (SRRIP) for scan-resistance • Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance • DRRIP requires ONLY 3-bits per block • In fact it incurs less storage than LRU http://hipe.korea.ac.kr

More Related