Computer Architecture Probabilistic L1 Cache Filtering

Download Presentation

Computer Architecture Probabilistic L1 Cache Filtering

Loading in 2 Seconds...

- 38 Views
- Uploaded on
- Presentation posted in: General

Computer Architecture Probabilistic L1 Cache Filtering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Computer ArchitectureProbabilistic L1 Cache Filtering

By Dan Tsafrir 7/5/2012Presentation based on slides by Yoav Etsion

- Paper titled
- “L1 cache filtering through random selection of memory references”

- Authors
- Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)

- Published in
- PACT 2007: the international conference on parallel architectures and complication techniques

- Can be downloaded from
- http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf

- CPUs get more and more cache dependent
- Growing gap between CPU and memory
- Growing popularity of multi-core chips

- Common solution: larger caches, but…
- Larger caches consume more power
- Longer wire delays yield longer latencies

- Need smarter caches
- We focus on insertion policies
- Currently everything goes into the cache
- Need to predict which blocks will be evicted quickly…
- … and prevent them from polluting the caches

- Reducing pollution may enable use of low-latency, low-power, direct-mapped caches
- Less pollution yields fewer conflicts

- The Problem:
- “It is difficult to make predictions, especially about the future”(many affiliations)

background

- In statistics, a PDF is, roughly, a function f describing
- The likelihood to get some value in some domain

- For example, f can specify how many students have a first name comprised of exactly k Hebrew letters
- f(1) = 0%f(2) = 22% (דן, רם, שי, שי, שי, שי, גל, חן, חן, בן, גד, טל, לי, ...)f(3) = 24% (גיל, גיל, רון, שיר, שיר, שיר, שיר, נגה, משה, חיה, רחל, חנה, ...)f(4) = 25% (יואב, אחמד, אביב, מיכל, מיכל, נועה, נועה, נועה, נועה, נועה,... )f(5) = 13% (אביטל, ירוחם, עירית, יהודה, חנניה, אביבה, אביתר, אביעד, ... )f(6) = 9% (יחזקאל, אביבית, אבינעם, אביגיל, שלומית, אבשלום, אדמונד, ... )f(7) = 6% (אביגדור, אבינועם, מתיתיהו,, עמנואלה, אנסטסיה... )f(8) = 0.6% (אלכסנדרה, ...)f(9) = 0.4% (קונסטנטין, ...)f(10) = 0

- Note that sigma[ f(k) ] = 100%

- In statistics, a CDF is, roughly, a function F describing
- The likelihood to get some value in some domain, or less

- For example, f can specify how many students have a first name comprised of exactly k Hebrew letters, or less
- F(1) = 0%= f(0)= 0%F(2) = 22%= f(0)+f(1)+f(2)= 0%+22%F(3) = 46%= f(0)+f(1)+f(2)= 0%+22%+24%F(4) = 71%= f(0)+f(1)+f(2)+f(3)= 0%+22%+24%+25%F(5) = 84%= F(4)+f(5)= 71% + 13%F(6) = 93%= F(5)+f(6)= 84% + 9%F(7) = 99%= F(6)+f(7)= 93% + 6%F(8) = 99.6%= F(7)+f(8)= 99% + 0.6%F(9) = 100%= F(8)+f(9)= 99.6% + 0.4%F(10) = 100%

- Generally, F(x) =, monotonically non-decreasing

- A “residency”
- Is what we call a block of memory
- From the time it was inserted to the cache
- Until the time it was evicted

- Each memory block can be associated with many residencies during a single execution

- Is what we call a block of memory
- “Residency length”
- Number of memory references (= load/store operation) served by the residency

- “The mass of residency length=k”
- Percent of memory references (throughout the entire program execution) that were served by residencies of length k

- At runtime, residency length is generated like so (assume C++):
class Cache_Line_Residency {

private:

int counter;// the residency length

public:

Cache_Line_Residency() {// constructor: a new object allocated// when a cache-line is allocated for a counter = 0; // newly inserted memory block}~Cache_Line() {// destructor: called when the block is// evicted from the cache (or when the// program ends)cout << counter << endl;}void do_reference() {// invoked whenever the cache line is// referenced (read from or written to)

counter++;}};

ctor

dtor

accessmemory

- Assume:
- Size of cache: 4 bytes
- Size of cache line: 2 bytes (namely, there are two lines)
- Cache is directly mapped=> address x maps into x % 4
- A program references memory (order: top to bottom):

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(1)

(2)

(2)

(1)

(1)

(2)

(2)

(2)

(2)

(1)

(2)

(1)

(1)

(1)

(3)

(3)

(2)

(90)

(2)

(3)

(2)

residency #1

#2

print 3

print 2

addresses (of memory references)

#3

print 3

#5

#4

print 90

print 2

program end

So printed residency lengths are: 3, 2, 3, 90, 2

- Thus, CDF of residency length is:
- 40% of residencies have length <= 2= |[2,2]| / |[3,2,3,90,2]|
- 80% of residencies have length <= 3= |[2,2,3,3]| / |[3,2,3,90,2]|
- 100% of residencies have length <= 90= |[2,2,3,3,90]| / |[3,2,3,90,2]|

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end

So printed residency lengths are: 3, 2, 3, 90, 2

- Thus, CDF of mass of references (“refs”) is:
- 4% of refs are to residencies with length <= 2= (2+2) / (3+2+3+90+2)
- 10% of refs are to residencies with len <= 3= (2+2+3+3) / (3+2+3+90+2)
- 100% of refs are to residencies w len <= 90= (2+2+3+3+90) / (3+2+3+90+2)

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end

100%

residency lengths

80%

the“counters”

60%

the“mass”

CDF

memory references

40%

20%

90

2

3

lengths

- “mass-count disparity” (disparity = שׁוֹנוּת, שֹׁנִי, נִבְדָּלוּת) is theterm describing the phenomenon shown in the graph, whereby:
- most of the mass resides in very few counters, and
- most of the counters count very little mass

- Using all benchmarks from the SPEC-CPU 2000 benchmarks suite
- In this presentation we show only four
- But we include all the rest in the averages

- The benchmarks were compiled for
- The Alpha AXP architecture

- All benchmarks were fast-forwarded 15 billion instructions (to skip any initialization code) and were then executed for another 2 billion instructions
- Unless otherwise stated, all simulated runs utilize a 16KB direct-mapped cache

Vast majority of residencies are relatively short

Which likely means they are transient

Small fraction of residencies are extremely long

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency

Fraction of memory references serviced by each length

Most references target residencies longer than, say, 10

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

- Every x value along the curves reveals how many of the residencies account for how much of the mass
- For example, in Crafty, 55% if the (shortest) residencies account for only 5% of the mass
- Which means the other 45% (longer) residencies account for 95% of the mass

The divergence between the distributions (= the mass-count disparity) can be quantiﬁed by the “joint ratio”

It’s a generalization of the proverbial 20/80 principle

Definition: the joint ratio is the unique point in the graphs where the sum of the two CDFs is 1

Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in middle of plot), meaning 13% of the (longest) residencies hold 87% of the mass of the memory references, while the remaining 87% of the residencies hold only 13% of the mass

Joint-Ratio

Count

Crafty

Crafty

Vortex

Vortex

Facerec

Facerec

Spsl

Spsl

data

Mass

CDF

instructions

length of residency

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

W1/2

CDF

instructions

length of residency

- Definition: overall mass (in %) of the shorter half of the residencies
- Example: in Vortex and Facerec W1/2 is less than 5% of the references
- Average W1/2 across all benchmarks is < 10% (median of W1/2is < 5%)

N1/2

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

- Definition: % of longer residencies accounting for half of the mass
- Example: in Vortex and Facerec N1/2 is less than 1% of the references
- Average N1/2 across all benchmarks is < 5% (median of N1/2is < 1%)

Let us utilize our understandings for…

- The mass-disparity we’ve identified means
- A small number of long residencies account for most memory references; but still most residencies are short

- So when randomly selecting a residency
- It would likely be a short residency

- Which means we have a way to approximate the future:
- Given a block about to be inserted into cache, probabilistically speaking, we know with high degree of certainty that it’d be disadvantageous to actually insert it…
- So we won’t! Instead, we’ll flip a coin…
- Heads = insert block to cache (small probability)
- Tails = insert block to a small filter (high probability)

- Rationale
- Long residencies will enjoy many coin-flips, so chances are they’ll eventually get into the cache
- Conversely, short residencies have little chance to get in

- Design
- Direct-mapped L1 + small fully-associative filter w/ CAM
- Insertion policy for lines not in L1: for each mem ref, flip biased coin to decide if line goes into filter or into L1

- SRAM is cache memory
- Not to be confused with DRAM
- Holds blocks that, by the coin flip, shouldn’t be inserted to L1

- Usage
- First, search data in L1
- If not found, search in filter
- If not found, go to L2, and then use above insertion policy

- Result
- Long residencies end up in L1
- Short residencies tend to end up in filter

- Benefit of randomness
- Filtering is purely statistical, eliminating the need to save any state or reuse information!

- Explored filter sizes
- 1KB, 2KB, and 4KB
- Consisting of 16, 32, and 64 lines, respectively
- Results presented in slides: were achieved using a 2K filter

Find the probability minimizing the miss-rate

High probability swamps cache

Low probability swamps filter

Constant selection probabilities seem sufficient

Data miss-rate reduced by ~25% for P = 5/100

Inst. miss-rate reduced by >60% for P = 1/1000

Data

Instruction

Reductionin miss rate

Random sampling with probability P turned out equivalent toperiodic sampling at a rate of ~1/P

Do not need real randomness

Majority of memory refs serviced by L1 cache, whereas majority of blocks remain in the filter; specifically:

L1 services 80% - 90% of refs

With only ~35% of the blocks

Data

Instruction

Reductionin miss rate

- Fully-associative filter uses CAM (content addressable memory)
- Input = address; output (on a hit) = “pointer” into SRAM saying where’s the associated data
- CAM lookup done in parallel

- Parallel lookup drawbacks
- Wastes energy
- Is slower (relative to direct-mapped)

- Possible solution
- Introducing the “WLB”…

- WLB is a small direct-mapped lookup table caching the most recent CAM lookups
- (Recall: given an address, CAM returns a pointer into SRAM; it’s a search like any search and therefore can be cached)
- Fast, low-power lookups

- Filter usage when addingto it the WLB
- First, search data in L1
- In parallel search its address in WLB
- If data not in L1 but WLB hits
- Access the SRAM without CAM

- If data not in L1 and WLB misses
- Only then use the slower / wasteful CAM

- If not found, go to L2 as usual

- WLB is quite effective with only 8 entries (for both I and D)
- Eliminates 77% of CAM data lookups
- Eliminates 98% of CAM instructions lookups

- Since WLB is so small and simple (direct map)
- It’s fast and consumes extremely low power
- Therefore, it can be looked up in parallel with main L1 cache

Instruction

Data

size of WLB [number of entries]

- 4 wide, out-of-order micro-architecture (SimpleScalar)
- (You’ll understand this when we learn out-of-order execution)

- Simulated L1
- 16K, 32K, 64K, with several set-associative configuration; latency:
- Direct-mapped: 1 cycle
- Set-associative: 2 cycles

- 16K, 32K, 64K, with several set-associative configuration; latency:
- Simulated filter
- 2K, fully-associative, with 8-entry WLB; latency: 5 cycles
- 1 cycle = for WLB (in parallel to accessing the cache)
- 3 cycles = for CAM lookup
- 1 cycle = for SRAM access

- 2K, fully-associative, with 8-entry WLB; latency: 5 cycles
- Simulated L2
- 512K; latency: 16 cycles

- Simulated main-memory
- Latency: 350 cycles

Comparing random sampling filter cache to other common cache designs

Outperforms a 4-way cache double its size!

Interesting: DM’s low-latency compensates for conflict misses

32K DM + filter

16K DM + filter

average relative improvement [%]

- Expectedly, DM-filtered loses to DM, because it’s more complex
- Direct mapped cache reduces dynamic power, but filter adds ~15% more leakage over 4-way
- Same size: 60%-80% reduction in dynamic power
- Double size: ~40% reduction in leakage

- The Mass-Count disparity phenomenon can be leveraged for caching policies
- Random Sampling effectively identifies frequently used blocks
- Adding just 2K filter is better than doubling the cache size, both in terms of IPC and power

- The WLB is effective at eliminating costly CAM lookups
- Offering fast, low-power access while maintaining fully-associativity benefits