High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP)

HighPerformance Cache Replacement Using Re-Reference Interval Prediction(RRIP) Aamer Jaleel, Kevin Theobald, Simon Stelly Jr., Joel Emer Intel Corporation, VSSAD International Symposium on Computer Architecture ( ISCA – 2010 ) HIgh PErformance Computing LAB

Motivation • Factors making caching important • Increasing ratio of CPU speed to memory speed • Multi-core poses challenges on better shared cache management • LRU has been the standard replacement policy at LLC • However LRU has problems http://hipe.korea.ac.kr

Problems with LRU Replacement http://hipe.korea.ac.kr

Desired Behavior from Cache Replacement http://hipe.korea.ac.kr

Prior Solutions • Working set larger than the cache Preserve some of working set in the cache • Dynamic Insertion Policy(DIP)  Thrash-resistance with minimal changes to HW • Recurring scans  Preserve frequently referenced working set in the cache • Least Frequently Used(LFU)  addresses scans • LFU adds complexity and also performs bad for recency friendly workloads • GOAL: Design a high performing scan-resistant policy that requires minimum changes to HW http://hipe.korea.ac.kr

Belady’s Optimal(OPT) Replacement Policy • Replacement decisions using perfect knowledge of future reference order • Victim Selection Policy • Replaces block that will be re-referenced furthest in future http://hipe.korea.ac.kr

Practical Cache Replacement Policies • Replacement decisions using predicting the future reference order • Victim Selection Policy • Replaces block predicted to be re-referenced furthest in future • Continually update predictions on the future reference order • Natural update opportunities are on cache fills and cache hits http://hipe.korea.ac.kr

LRU Replacement in Prediction Framework • The LRU chain maintains the re-reference prediction • Head of chain(MRU) predicted to be re-referenced soon • Tail of chain(LRU) predicted to re-referenced far in the future • LRU predicts that blocks are re-referenced in reverse order of reference • Rename LRU chain to the “Re-Reference Prediction(RRP) Chain” http://hipe.korea.ac.kr

Practicality of Chain based Replacement • Problem: Chain based replacement is too expensive • Log2(associativity) bits required per cache block • Solution: LRU chain positions can be quantized into different buckets • Each bucket corresponds to a predicted Re-Reference Interval • Value of bucket is called the Re-Reference Prediction Value(RRPV) • Hardware Cost: ‘n’ bits per block http://hipe.korea.ac.kr

Representation of Quantized Replacement (n = 2) http://hipe.korea.ac.kr

Emulating LRU with Quantized Buckets (n=2) • Victim Selection Policy: Evict block with distant RRPV • If no distant RRPV(2n-1=‘3’) found, increment all RRPVs and repeat • If multiple found, need tie breaker. In this paper, searched from physical way ‘0’ • Insertion Policy: Insert new block with RRPV=‘0’ • Update Policy: Cache hits update the block’s RRPV=‘0’ 0 1 2 3 4 5 6 7 Physical Way # Cache Tag f g c h d e a b RRPV victim block hit 1 0 2 0 1 1 3 3 s 0 http://hipe.korea.ac.kr 0

Re-Reference Interval Prediction(RRIP) • Framework enables re-reference predictions to be tuned at insertion/update • Unlike LRU, can use non-zero RRPV on insertion • Unlike LRU, can use a non-zero RRPV on cache hits • Static Re-Reference Interval Prediction(SRRIP) • Determine best insertion/update prediction using profiling • Dynamic Re-Reference Interval Prediction(DRRIP) • Dynamically determine best re-reference prediction at insertion http://hipe.korea.ac.kr

Static RRIP Insertion Policy • Key idea: Do not give new blocks too much(or too little)time in the cache • Predict new cache block will not be re-referenced soon • Insert new block will some RRPV other than ‘0’ • Similar to inserting in the “middle” of the RRP chain(however, it is not identical to a fixed insertion position on RRP chain) 0 1 2 3 4 5 6 7 Physical Way # Cache Tag b c e a g d f h RRPV victim block 3 2 1 3 0 1 1 0 s http://hipe.korea.ac.kr 2

Static RRIP Update Policy on Cache Hits • Hit Priority (HP) • Like LRU, Always update RRPV=0 on cache hits. • Intuition: Predicts that blocks receiving hits after insertion will be re-referenced soon 0 1 2 3 4 5 6 7 Physical Way # Cache Tag f g c h d e a b hit RRPV 1 0 2 0 1 1 3 3 0 http://hipe.korea.ac.kr

Evaluation • Simulator • CMP$im • Baseline Processor • 4 way OOO with 128-entry reorder buffer • L1: 4way 32KB for each instruction and data cache (1cycle) • L2: 8way 256KB (10 cycles) • L3: 16way 2MB(for single core) / 8MB(for 4-core) (24 cycles) • Line size: 64B • 250 cycles penalty to main-memory http://hipe.korea.ac.kr

Evaluation • Workload • 250M instructions http://hipe.korea.ac.kr

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC http://hipe.korea.ac.kr

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC

Why does RRPVinsertion of 2n-2 work best? • Before scan, re-reference prediction of active working set is ‘0’ • Recall, NRU (n=1) is not scan-resistant • For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks • Larger insertion RRPV tolerates larger scans • Maximum insertion prediction works best http://hipe.korea.ac.kr

DRRIP: Extending Scan-Resistant SRRIP to be Thrash-Resistant • Always using same prediction for all insertions will thrasheds the cache • Need to preserve some fraction of working set in cache • Dynamic Re-Reference Interval Prediction • Dynamically select between inserting blocks with 2n-1 and 2n-2 using Set Dueling • Inserting blocks with 2n-1 is same as “no update insertion” http://hipe.korea.ac.kr

Performance Comparison of Replacement Policies http://hipe.korea.ac.kr

Total Storage Overhead(16-way Set Associative Cache) • LRU: 4bits / cache block • NRU: 1bit / cache block • DRRIP-3: 3bits / cache block • DRRIP Outperforms LRU with less storage than LRU http://hipe.korea.ac.kr

Summary • Scan-resistance is an important problem in commercial workloads • State-of-the art policies do not address scan-resistance • Propose a Simple and Practical Replacement Policy • Static RRIP (SRRIP) for scan-resistance • Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance • DRRIP requires ONLY 3-bits per block • In fact it incurs less storage than LRU http://hipe.korea.ac.kr

High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP)

High Performance Cache Replacement Using Re-Reference Interval Prediction(RRIP)

Presentation Transcript

Cache performance

Cache Replacement Policies

Adaptive Cache Compression for High-Performance Processors

Cache Performance

Using the Compiler to Improve Cache Replacement Decisions

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP )

Cache Performance Metrics

ARC (Adaptive Replacement Cache)

Cache performance

Double Replacement Prediction

Double Replacement Prediction

Cache-Miss Prediction

Improving Proxy Cache Performance: Analysis of Three Replacement Policies

Cache Performance

Performance Evaluation of Web Proxy Cache Replacement Policies

Enabling BITE: High-Performance Snapshots in a High-Level Cache

Cache Replacement Algorithm

Cache Performance Analysis

Improving Cache Performance