Enhancing Cache Efficiency through Probabilistic L1 Cache Filtering

Computer ArchitectureProbabilistic L1 Cache Filtering By Dan Tsafrir 7/5/2012Presentation based on slides by Yoav Etsion

Lecture is based on… • Paper titled • “L1 cache filtering through random selection of memory references” • Authors • Yoav Etsion and Dror G. Feitelson (from the Hebrew U.) • Published in • PACT 2007: the international conference on parallel architectures and complication techniques • Can be downloaded from • http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf

Motivation

A Case for more efficient caches • CPUs get more and more cache dependent • Growing gap between CPU and memory • Growing popularity of multi-core chips • Common solution: larger caches, but… • Larger caches consume more power • Longer wire delays yield longer latencies

Efficiency through insertion policies • Need smarter caches • We focus on insertion policies • Currently everything goes into the cache • Need to predict which blocks will be evicted quickly… • … and prevent them from polluting the caches • Reducing pollution may enable use of low-latency, low-power, direct-mapped caches • Less pollution yields fewer conflicts • The Problem: • “It is difficult to make predictions, especially about the future”(many affiliations)

background CDF,Residency lengths,mass-count disparity

PDF(probability distribution function) • In statistics, a PDF is, roughly, a function f describing • The likelihood to get some value in some domain • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters • f(1) = 0%f(2) = 22% (דן, רם, שי, שי, שי, שי, גל, חן, חן, בן, גד, טל, לי, ...)f(3) = 24% (גיל, גיל, רון, שיר, שיר, שיר, שיר, נגה, משה, חיה, רחל, חנה, ...)f(4) = 25% (יואב, אחמד, אביב, מיכל, מיכל, נועה, נועה, נועה, נועה, נועה,... )f(5) = 13% (אביטל, ירוחם, עירית, יהודה, חנניה, אביבה, אביתר, אביעד, ... )f(6) = 9% (יחזקאל, אביבית, אבינעם, אביגיל, שלומית, אבשלום, אדמונד, ... )f(7) = 6% (אביגדור, אבינועם, מתיתיהו,, עמנואלה, אנסטסיה... )f(8) = 0.6% (אלכסנדרה, ...)f(9) = 0.4% (קונסטנטין, ...)f(10) = 0 • Note that sigma[ f(k) ] = 100%

CDF(cumulative distribution function) • In statistics, a CDF is, roughly, a function F describing • The likelihood to get some value in some domain, or less • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters, or less • F(1) = 0% = f(0) = 0%F(2) = 22% = f(0)+f(1)+f(2) = 0%+22%F(3) = 46% = f(0)+f(1)+f(2) = 0%+22%+24%F(4) = 71% = f(0)+f(1)+f(2)+f(3) = 0%+22%+24%+25%F(5) = 84% = F(4)+f(5) = 71% + 13%F(6) = 93% = F(5)+f(6) = 84% + 9%F(7) = 99% = F(6)+f(7) = 93% + 6%F(8) = 99.6% = F(7)+f(8) = 99% + 0.6%F(9) = 100% = F(8)+f(9) = 99.6% + 0.4%F(10) = 100% • Generally, F(x) =, monotonically non-decreasing

Cache residency • A “residency” • Is what we call a block of memory • From the time it was inserted to the cache • Until the time it was evicted • Each memory block can be associated with many residencies during a single execution • “Residency length” • Number of memory references (= load/store operation) served by the residency • “The mass of residency length=k” • Percent of memory references (throughout the entire program execution) that were served by residencies of length k

Computing residency length on-the-fly • At runtime, residency length is generated like so (assume C++): class Cache_Line_Residency { private: int counter; // the residency length public: Cache_Line_Residency() { // constructor: a new object allocated// when a cache-line is allocated for a counter = 0; // newly inserted memory block } ~Cache_Line() { // destructor: called when the block is// evicted from the cache (or when the// program ends)cout << counter << endl; } void do_reference() { // invoked whenever the cache line is// referenced (read from or written to) counter++; } }; ctor dtor accessmemory

Example • Assume: • Size of cache: 4 bytes • Size of cache line: 2 bytes (namely, there are two lines) • Cache is directly mapped => address x maps into x % 4 • A program references memory (order: top to bottom): 013304744600…0 x x (1) (2) x x x x x x x x x x x (2) (1) (2) (2) (1) (1) (2) (2) (2) (2) (1) (2) (1) (1) (1) (3) (3) (2) (90) (2) (3) (2) residency #1 #2 print 3 print 2 addresses (of memory references) #3 print 3 #5 #4 print 90 print 2 program end

Example – CDF of residency lengths So printed residency lengths are: 3, 2, 3, 90, 2 • Thus, CDF of residency length is: • 40% of residencies have length <= 2= |[2,2]| / |[3,2,3,90,2]| • 80% of residencies have length <= 3= |[2,2,3,3]| / |[3,2,3,90,2]| • 100% of residencies have length <= 90= |[2,2,3,3,90]| / |[3,2,3,90,2]| 013304744600…0 x x (1) (2) x x x x x x x x x x x (2) (2) (1) (1) (2) (2) (1) (2) (2) (2) (1) (2) (3) (1) (3) (2) (3) (2) (1) (2) (1) (90) residency #1 #2 print 3 100% print 2 addresses (of memory references) 80% #3 60% CDF print 3 #5 40% #4 20% print 90 90 2 3 print 2 lengths program end

Example – CDF of mass of references So printed residency lengths are: 3, 2, 3, 90, 2 • Thus, CDF of mass of references (“refs”) is: • 4% of refs are to residencies with length <= 2= (2+2) / (3+2+3+90+2) • 10% of refs are to residencies with len <= 3= (2+2+3+3) / (3+2+3+90+2) • 100% of refs are to residencies w len <= 90= (2+2+3+3+90) / (3+2+3+90+2) 013304744600…0 x x (1) (2) x x x x x x x x x x x (2) (2) (1) (1) (2) (2) (1) (2) (2) (2) (1) (2) (3) (1) (3) (2) (3) (2) (1) (2) (1) (90) residency #1 #2 print 3 100% print 2 addresses (of memory references) 80% #3 60% CDF print 3 #5 40% #4 20% print 90 90 2 3 print 2 lengths program end

Superimposing graphs 100% residency lengths 80% the“counters” 60% the“mass” CDF memory references 40% 20% 90 2 3 lengths • “mass-count disparity” (disparity = שׁוֹנוּת, שֹׁנִי, נִבְדָּלוּת) is theterm describing the phenomenon shown in the graph, whereby: • most of the mass resides in very few counters, and • most of the counters count very little mass

Results fromreal benchmarks

Methodology • Using all benchmarks from the SPEC-CPU 2000 benchmarks suite • In this presentation we show only four • But we include all the rest in the averages • The benchmarks were compiled for • The Alpha AXP architecture • All benchmarks were fast-forwarded 15 billion instructions (to skip any initialization code) and were then executed for another 2 billion instructions • Unless otherwise stated, all simulated runs utilize a 16KB direct-mapped cache

Vast majority of residencies are relatively short Which likely means they are transient Small fraction of residencies are extremely long CDF of residency length(of 4 SPEC benchmark apps) Crafty Vortex Facerec Spsl data CDF instructions length of residency

Fraction of memory references serviced by each length Most references target residencies longer than, say, 10 CDF of mass of residencies(of 4 SPEC benchmark apps) Crafty Vortex Facerec Spsl data CDF instructions length of residency

Superimposing graphs reveals mass-count disparity Count Crafty Vortex Facerec Spsl data Mass CDF instructions length of residency • Every x value along the curves reveals how many of the residencies account for how much of the mass • For example, in Crafty, 55% if the (shortest) residencies account for only 5% of the mass • Which means the other 45% (longer) residencies account for 95% of the mass

The divergence between the distributions (= the mass-count disparity) can be quantiﬁed by the “joint ratio” It’s a generalization of the proverbial 20/80 principle Definition: the joint ratio is the unique point in the graphs where the sum of the two CDFs is 1 Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in middle of plot), meaning 13% of the (longest) residencies hold 87% of the mass of the memory references, while the remaining 87% of the residencies hold only 13% of the mass The joint-ratiomass-disparity metric Joint-Ratio Count Crafty Crafty Vortex Vortex Facerec Facerec Spsl Spsl data Mass CDF instructions length of residency

The W1/2 mass-disparity metric Count Crafty Vortex Facerec Spsl data Mass W1/2 CDF instructions length of residency • Definition: overall mass (in %) of the shorter half of the residencies • Example: in Vortex and Facerec W1/2 is less than 5% of the references • Average W1/2 across all benchmarks is < 10% (median of W1/2is < 5%)

The N1/2 mass-disparity metric N1/2 Count Crafty Vortex Facerec Spsl data Mass CDF instructions length of residency • Definition: % of longer residencies accounting for half of the mass • Example: in Vortex and Facerec N1/2 is less than 1% of the references • Average N1/2 across all benchmarks is < 5% (median of N1/2is < 1%)

Let us utilize our understandings for… Designing a new insertion policy

Probabilistic insertion? • The mass-disparity we’ve identified means • A small number of long residencies account for most memory references; but still most residencies are short • So when randomly selecting a residency • It would likely be a short residency • Which means we have a way to approximate the future: • Given a block about to be inserted into cache, probabilistically speaking, we know with high degree of certainty that it’d be disadvantageous to actually insert it… • So we won’t! Instead, we’ll flip a coin… • Heads = insert block to cache (small probability) • Tails = insert block to a small filter (high probability) • Rationale • Long residencies will enjoy many coin-flips, so chances are they’ll eventually get into the cache • Conversely, short residencies have little chance to get in

L1 with random filter • Design • Direct-mapped L1 + small fully-associative filter w/ CAM • Insertion policy for lines not in L1: for each mem ref, flip biased coin to decide if line goes into filter or into L1 • SRAM is cache memory • Not to be confused with DRAM • Holds blocks that, by the coin flip, shouldn’t be inserted to L1 • Usage • First, search data in L1 • If not found, search in filter • If not found, go to L2, and then use above insertion policy

L1 with random filter • Result • Long residencies end up in L1 • Short residencies tend to end up in filter • Benefit of randomness • Filtering is purely statistical, eliminating the need to save any state or reuse information! • Explored filter sizes • 1KB, 2KB, and 4KB • Consisting of 16, 32, and 64 lines, respectively • Results presented in slides: were achieved using a 2K filter

Find the probability minimizing the miss-rate High probability swamps cache Low probability swamps filter Constant selection probabilities seem sufficient Data miss-rate reduced by ~25% for P = 5/100 Inst. miss-rate reduced by >60% for P = 1/1000 Exploring coin bias Data Instruction Reductionin miss rate

Random sampling with probability P turned out equivalent toperiodic sampling at a rate of ~1/P Do not need real randomness Majority of memory refs serviced by L1 cache, whereas majority of blocks remain in the filter; specifically: L1 services 80% - 90% of refs With only ~35% of the blocks Exploring coin bias Data Instruction Reductionin miss rate

Problem – CAM is wasteful & slow • Fully-associative filter uses CAM (content addressable memory) • Input = address; output (on a hit) = “pointer” into SRAM saying where’s the associated data • CAM lookup done in parallel • Parallel lookup drawbacks • Wastes energy • Is slower (relative to direct-mapped) • Possible solution • Introducing the “WLB”…

Wordline look-aside Buffer (WLB) • WLB is a small direct-mapped lookup table caching the most recent CAM lookups • (Recall: given an address, CAM returns a pointer into SRAM; it’s a search like any search and therefore can be cached) • Fast, low-power lookups • Filter usage when addingto it the WLB • First, search data in L1 • In parallel search its address in WLB • If data not in L1 but WLB hits • Access the SRAM without CAM • If data not in L1 and WLB misses • Only then use the slower / wasteful CAM • If not found, go to L2 as usual

Effectiveness of WLB? • WLB is quite effective with only 8 entries (for both I and D) • Eliminates 77% of CAM data lookups • Eliminates 98% of CAM instructions lookups • Since WLB is so small and simple (direct map) • It’s fast and consumes extremely low power • Therefore, it can be looked up in parallel with main L1 cache Instruction Data size of WLB [number of entries]

performanceevaluation

Methodology • 4 wide, out-of-order micro-architecture (SimpleScalar) • (You’ll understand this when we learn out-of-order execution) • Simulated L1 • 16K, 32K, 64K, with several set-associative configuration; latency: • Direct-mapped: 1 cycle • Set-associative: 2 cycles • Simulated filter • 2K, fully-associative, with 8-entry WLB; latency: 5 cycles • 1 cycle = for WLB (in parallel to accessing the cache) • 3 cycles = for CAM lookup • 1 cycle = for SRAM access • Simulated L2 • 512K; latency: 16 cycles • Simulated main-memory • Latency: 350 cycles

Comparing random sampling filter cache to other common cache designs Outperforms a 4-way cache double its size! Interesting: DM’s low-latency compensates for conflict misses Results – runtime 32K DM + filter 16K DM + filter average relative improvement [%]

Results – power consumption • Expectedly, DM-filtered loses to DM, because it’s more complex • Direct mapped cache reduces dynamic power, but filter adds ~15% more leakage over 4-way • Same size: 60%-80% reduction in dynamic power • Double size: ~40% reduction in leakage

Conclusions • The Mass-Count disparity phenomenon can be leveraged for caching policies • Random Sampling effectively identifies frequently used blocks • Adding just 2K filter is better than doubling the cache size, both in terms of IPC and power • The WLB is effective at eliminating costly CAM lookups • Offering fast, low-power access while maintaining fully-associativity benefits

Enhancing Cache Efficiency through Probabilistic L1 Cache Filtering