1 / 32

Adaptive Insertion Policies for High-Performance Caching

Adaptive Insertion Policies for High-Performance Caching. Aamer Jaleel Simon C. Steely Jr. Joel Emer. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Computer Architecture (ISCA) 2007. Memory. L2 miss. Proc. L2. L1. Background.

konane
Download Presentation

Adaptive Insertion Policies for High-Performance Caching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Insertion Policies for High-Performance Caching Aamer JaleelSimon C. Steely Jr.Joel Emer Moinuddin K. QureshiYale N. Patt International Symposium on Computer Architecture (ISCA) 2007

  2. Memory L2 miss Proc L2 L1 Background Fast processor + Slow memory  Cache hierarchy (~10 cycles) (~2 cycles) (~300 cycles) L1 misses  Short latency, can be hidden L2 misses  Long-latency, hurts performance Important to reduce Last Level (L2) cache misses

  3. Motivation • L1 for latency, L2 for capacity • Traditionally L2 managed similar to L1 (typically LRU) • L1 filters temporal locality  Poor locality at L2 • LRU causes thrashing when working set > cache size Most lines remain unused between insertion and eviction

  4. Dead on Arrival (DoA) Lines DoA Lines: Lines unused between insertion and eviction (%) DoA Lines • For the 1MB 16-way L2, 60% of lines are DoA •  Ineffective use of cache space

  5. art mcf Misses per 1000 instructions Misses per 1000 instructions Cache size in MB Cache size in MB Why DoA Lines ? • Streaming data  Never reused. L2 caches don’t help. • Working set of application greater than cache size Soln: if working set > cache size, retain some working set

  6. Overview Problem: LRU replacement inefficient for L2 caches Goal: A replacement policy that has: 1. Low hardware overhead 2. Low complexity 3. High performance 4. Robust across workloads Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes

  7. Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary

  8. Cache Insertion Policy • Two components of cache replacement: • Victim Selection:Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU) • Insertion Policy:Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position) Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads

  9. MRU LRU a b c d e f g h Reference to ‘i’ with traditional LRU policy: i a b c d e f g Reference to ‘i’ with LIP: a b c d e f g i LRU-Insertion Policy (LIP) Choose victim. Do NOT promote to MRU Lines do not enter non-LRU positions unless reused

  10. Bimodal-Insertion Policy (BIP) LIP does not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position;else Insert at LRU position; For small e , BIP retains thrashing protection of LIP while responding to changes in working set

  11. Circular Reference Model [Smith & GoodmanISCA’84] Reference stream has T blocks and repeats N times. Cache has K blocks (K<T and N>>T) For small e , BIP retains thrashing protection of LIP while adapting to changes in working set

  12. LIP BIP(e=1/32) Results for LIP and BIP (%) Reduction in L2 MPKI Changes to insertion policy increases misses for LRU-friendly workloads

  13. Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary

  14. Dynamic-Insertion Policy (DIP) • Two types of workloads: LRU-friendly or BIP-friendly • DIP can be implemented by: • Monitor both policies (LRU and BIP) • Choose the best-performing policy • Apply the best policy to the cache Need a cost-effective implementation  “Set Dueling”

  15. miss LRU-sets + BIP-sets – miss Follower Sets MSB = 0? No YES Use LRU Use BIP DIP via “Set Dueling” Divide the cache in three: • Dedicated LRU sets • Dedicated BIP sets • Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU-sets:counter++ misses to BIP-set: counter-- Counter decides policy for Follower sets: • MSB = 0, Use LRU • MSB = 1, Use BIP n-bit cntr monitor  choose  apply (using a single counter)

  16. Bounds on Dedicated Sets How many dedicated sets required for “Set Dueling”? μLRU, σLRU, μBIP, σBIP= Avg. misses and stdev. for LRU and BIP P(Best) = probability of selecting best policy P(Best) = P(Z< r√n) n = number of dedicated setsZ = standard Gaussian variabler = |μLRU-μBIP|/√(σLRU2 + σBIP2) (For majority workloads r > 0.2) 32-64 dedicated sets sufficient

  17. DIP (32 dedicated sets) Results for DIP BIP (%) Reduction in L2 MPKI DIP reduces average MPKI by 21% and requires < two bytes storage overhead

  18. DIP vs. Other Policies (%) Reduction in L2 MPKI DIP OPT Double(2MB) (LRU+RND) (LRU+LFU) (LRU+MRU) DIP bridges two-thirds of gap between LRU and OPT

  19. IPC Improvement Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU IPC Improvement with DIP (%) DIP Improves IPC by 9.3% on average

  20. Outline • Introduction • Static Insertion Policies • Dynamic Insertion Policies • Summary

  21. Summary LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction Proposed changes to cache insertion policy (DIP) has:1. Low hardware overhead Requires < two bytes storage overhead 2. Low complexity Trivial to implement. No changes to cache structure 3. High performance Reduces misses by 21%. Two-thirds as good as OPT 4. Robust across workloads Almost as good as LRU for LRU-friendly workloads    

  22. Questions source code:www.ece.utexas.edu/~qk/dip

  23. DIP LRU 8MB 2MB 4MB 1MB } } } } DIP vs. LRU Across Cache Sizes MPKI Relative to 1MB LRU (%)(Smaller is better) Avg_16 art mcf swim health equake MPKI reduces till workload fits in the cache

  24. DIP with 1MB 8-way L2 Cache 50 40 30 (%) Reduction in L2 MPKI 20 10 0 MPKI reduction with 8-way (19%) similar to 16-way (21%)

  25. Interaction with Prefetching (PC-based stride prefetcher) DIP-NoPref LRU-Pref DIP-Pref (%) Reduction in L2 MPKI DIP also works well in presence of prefetching

  26. mcf snippet

  27. art snippet

  28. health mpki

  29. swim mpki

  30. DIP Bypass

  31. DIP (design and implementation)

  32. Random Replacement (Success Function) Cache contains K blocks and reference stream contains T Prob that a block in cache survives 1 eviction = (1-1/K) Total number of evictions = (T-1)*Pmiss Phit = (1-1/K)^(T-1)*Pmiss) Phit = (1-1/K)^(T-1)(1-Phit) Iterative solution: Start at Phit=0 1. Phit = (1-1/K)^T

More Related