computer architecture probabilistic l1 cache filtering
Download
Skip this Video
Download Presentation
Computer Architecture Probabilistic L1 Cache Filtering

Loading in 2 Seconds...

play fullscreen
1 / 36

Computer Architecture Probabilistic L1 Cache Filtering - PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on

Computer Architecture Probabilistic L1 Cache Filtering. By Dan Tsafrir 7/5/2012 Presentation based on slides by Yoav Etsion. Lecture is based on…. P aper titled “L1 c ache f iltering t hrough r andom s election of memory r eferences” Authors

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computer Architecture Probabilistic L1 Cache Filtering' - garry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computer architecture probabilistic l1 cache filtering

Computer ArchitectureProbabilistic L1 Cache Filtering

By Dan Tsafrir 7/5/2012Presentation based on slides by Yoav Etsion

lecture is based on
Lecture is based on…
  • Paper titled
    • “L1 cache filtering through random selection of memory references”
  • Authors
    • Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)
  • Published in
    • PACT 2007: the international conference on parallel architectures and complication techniques
  • Can be downloaded from
    • http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf
a case for more e fficient c aches
A Case for more efficient caches
  • CPUs get more and more cache dependent
    • Growing gap between CPU and memory
    • Growing popularity of multi-core chips
  • Common solution: larger caches, but…
    • Larger caches consume more power
    • Longer wire delays yield longer latencies
efficiency through i nsertion p olicies
Efficiency through insertion policies
  • Need smarter caches
  • We focus on insertion policies
    • Currently everything goes into the cache
    • Need to predict which blocks will be evicted quickly…
    • … and prevent them from polluting the caches
  • Reducing pollution may enable use of low-latency, low-power, direct-mapped caches
    • Less pollution yields fewer conflicts
  • The Problem:
    • “It is difficult to make predictions, especially about the future”(many affiliations)
pdf probability d istribution f unction
PDF(probability distribution function)
  • In statistics, a PDF is, roughly, a function f describing
    • The likelihood to get some value in some domain
  • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters
    • f(1) = 0%f(2) = 22% (דן, רם, שי, שי, שי, שי, גל, חן, חן, בן, גד, טל, לי, ...)f(3) = 24% (גיל, גיל, רון, שיר, שיר, שיר, שיר, נגה, משה, חיה, רחל, חנה, ...)f(4) = 25% (יואב, אחמד, אביב, מיכל, מיכל, נועה, נועה, נועה, נועה, נועה,... )f(5) = 13% (אביטל, ירוחם, עירית, יהודה, חנניה, אביבה, אביתר, אביעד, ... )f(6) = 9% (יחזקאל, אביבית, אבינעם, אביגיל, שלומית, אבשלום, אדמונד, ... )f(7) = 6% (אביגדור, אבינועם, מתיתיהו,, עמנואלה, אנסטסיה... )f(8) = 0.6% (אלכסנדרה, ...)f(9) = 0.4% (קונסטנטין, ...)f(10) = 0
  • Note that sigma[ f(k) ] = 100%
c df cumulative d istribution f unction
CDF(cumulative distribution function)
  • In statistics, a CDF is, roughly, a function F describing
    • The likelihood to get some value in some domain, or less
  • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters, or less
    • F(1) = 0% = f(0) = 0%F(2) = 22% = f(0)+f(1)+f(2) = 0%+22%F(3) = 46% = f(0)+f(1)+f(2) = 0%+22%+24%F(4) = 71% = f(0)+f(1)+f(2)+f(3) = 0%+22%+24%+25%F(5) = 84% = F(4)+f(5) = 71% + 13%F(6) = 93% = F(5)+f(6) = 84% + 9%F(7) = 99% = F(6)+f(7) = 93% + 6%F(8) = 99.6% = F(7)+f(8) = 99% + 0.6%F(9) = 100% = F(8)+f(9) = 99.6% + 0.4%F(10) = 100%
  • Generally, F(x) =, monotonically non-decreasing
cache r esidency
Cache residency
  • A “residency”
    • Is what we call a block of memory
      • From the time it was inserted to the cache
      • Until the time it was evicted
    • Each memory block can be associated with many residencies during a single execution
  • “Residency length”
    • Number of memory references (= load/store operation) served by the residency
  • “The mass of residency length=k”
    • Percent of memory references (throughout the entire program execution) that were served by residencies of length k
computing residency length on the fly
Computing residency length on-the-fly
  • At runtime, residency length is generated like so (assume C++):

class Cache_Line_Residency {

private:

int counter; // the residency length

public:

Cache_Line_Residency() { // constructor: a new object allocated// when a cache-line is allocated for a counter = 0; // newly inserted memory block } ~Cache_Line() { // destructor: called when the block is// evicted from the cache (or when the// program ends)cout << counter << endl; } void do_reference() { // invoked whenever the cache line is// referenced (read from or written to)

counter++; } };

ctor

dtor

accessmemory

example
Example
  • Assume:
    • Size of cache: 4 bytes
    • Size of cache line: 2 bytes (namely, there are two lines)
    • Cache is directly mapped => address x maps into x % 4
    • A program references memory (order: top to bottom):

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(1)

(2)

(2)

(1)

(1)

(2)

(2)

(2)

(2)

(1)

(2)

(1)

(1)

(1)

(3)

(3)

(2)

(90)

(2)

(3)

(2)

residency #1

#2

print 3

print 2

addresses (of memory references)

#3

print 3

#5

#4

print 90

print 2

program end

example cdf of residency lengths
Example – CDF of residency lengths

So printed residency lengths are: 3, 2, 3, 90, 2

  • Thus, CDF of residency length is:
  • 40% of residencies have length <= 2= |[2,2]| / |[3,2,3,90,2]|
  • 80% of residencies have length <= 3= |[2,2,3,3]| / |[3,2,3,90,2]|
  • 100% of residencies have length <= 90= |[2,2,3,3,90]| / |[3,2,3,90,2]|

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end

example cdf of mass of references
Example – CDF of mass of references

So printed residency lengths are: 3, 2, 3, 90, 2

  • Thus, CDF of mass of references (“refs”) is:
  • 4% of refs are to residencies with length <= 2= (2+2) / (3+2+3+90+2)
  • 10% of refs are to residencies with len <= 3= (2+2+3+3) / (3+2+3+90+2)
  • 100% of refs are to residencies w len <= 90= (2+2+3+3+90) / (3+2+3+90+2)

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end

superimposing graphs
Superimposing graphs

100%

residency lengths

80%

the“counters”

60%

the“mass”

CDF

memory references

40%

20%

90

2

3

lengths

  • “mass-count disparity” (disparity = שׁוֹנוּת, שֹׁנִי, נִבְדָּלוּת) is theterm describing the phenomenon shown in the graph, whereby:
  • most of the mass resides in very few counters, and
  • most of the counters count very little mass
methodology
Methodology
  • Using all benchmarks from the SPEC-CPU 2000 benchmarks suite
    • In this presentation we show only four
    • But we include all the rest in the averages
  • The benchmarks were compiled for
    • The Alpha AXP architecture
  • All benchmarks were fast-forwarded 15 billion instructions (to skip any initialization code) and were then executed for another 2 billion instructions
  • Unless otherwise stated, all simulated runs utilize a 16KB direct-mapped cache
cdf of residency length of 4 spec benchmark apps
Vast majority of residencies are relatively short

Which likely means they are transient

Small fraction of residencies are extremely long

CDF of residency length(of 4 SPEC benchmark apps)

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency

cdf of mass of residencies of 4 spec benchmark apps
Fraction of memory references serviced by each length

Most references target residencies longer than, say, 10

CDF of mass of residencies(of 4 SPEC benchmark apps)

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency

superimposing graphs reveals mass count disparity
Superimposing graphs reveals mass-count disparity

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

  • Every x value along the curves reveals how many of the residencies account for how much of the mass
  • For example, in Crafty, 55% if the (shortest) residencies account for only 5% of the mass
  • Which means the other 45% (longer) residencies account for 95% of the mass
the joint ratio mass disparity metric
The divergence between the distributions (= the mass-count disparity) can be quantified by the “joint ratio”

It’s a generalization of the proverbial 20/80 principle

Definition: the joint ratio is the unique point in the graphs where the sum of the two CDFs is 1

Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in middle of plot), meaning 13% of the (longest) residencies hold 87% of the mass of the memory references, while the remaining 87% of the residencies hold only 13% of the mass

The joint-ratiomass-disparity metric

Joint-Ratio

Count

Crafty

Crafty

Vortex

Vortex

Facerec

Facerec

Spsl

Spsl

data

Mass

CDF

instructions

length of residency

the w 1 2 mass disparity metric
The W1/2 mass-disparity metric

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

W1/2

CDF

instructions

length of residency

  • Definition: overall mass (in %) of the shorter half of the residencies
  • Example: in Vortex and Facerec W1/2 is less than 5% of the references
  • Average W1/2 across all benchmarks is < 10% (median of W1/2is < 5%)
the n 1 2 mass disparity metric
The N1/2 mass-disparity metric

N1/2

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

  • Definition: % of longer residencies accounting for half of the mass
  • Example: in Vortex and Facerec N1/2 is less than 1% of the references
  • Average N1/2 across all benchmarks is < 5% (median of N1/2is < 1%)
probabilistic insertion
Probabilistic insertion?
  • The mass-disparity we’ve identified means
    • A small number of long residencies account for most memory references; but still most residencies are short
  • So when randomly selecting a residency
    • It would likely be a short residency
  • Which means we have a way to approximate the future:
    • Given a block about to be inserted into cache, probabilistically speaking, we know with high degree of certainty that it’d be disadvantageous to actually insert it…
    • So we won’t! Instead, we’ll flip a coin…
      • Heads = insert block to cache (small probability)
      • Tails = insert block to a small filter (high probability)
  • Rationale
    • Long residencies will enjoy many coin-flips, so chances are they’ll eventually get into the cache
    • Conversely, short residencies have little chance to get in
l1 with random filter
L1 with random filter
  • Design
    • Direct-mapped L1 + small fully-associative filter w/ CAM
    • Insertion policy for lines not in L1: for each mem ref, flip biased coin to decide if line goes into filter or into L1
  • SRAM is cache memory
    • Not to be confused with DRAM
    • Holds blocks that, by the coin flip, shouldn’t be inserted to L1
  • Usage
    • First, search data in L1
    • If not found, search in filter
    • If not found, go to L2, and then use above insertion policy
l1 with random filter1
L1 with random filter
  • Result
    • Long residencies end up in L1
    • Short residencies tend to end up in filter
  • Benefit of randomness
    • Filtering is purely statistical, eliminating the need to save any state or reuse information!
  • Explored filter sizes
    • 1KB, 2KB, and 4KB
    • Consisting of 16, 32, and 64 lines, respectively
    • Results presented in slides: were achieved using a 2K filter
exploring coin bias
Find the probability minimizing the miss-rate

High probability swamps cache

Low probability swamps filter

Constant selection probabilities seem sufficient

Data miss-rate reduced by ~25% for P = 5/100

Inst. miss-rate reduced by >60% for P = 1/1000

Exploring coin bias

Data

Instruction

Reductionin miss rate

exploring coin bias1
Random sampling with probability P turned out equivalent toperiodic sampling at a rate of ~1/P

Do not need real randomness

Majority of memory refs serviced by L1 cache, whereas majority of blocks remain in the filter; specifically:

L1 services 80% - 90% of refs

With only ~35% of the blocks

Exploring coin bias

Data

Instruction

Reductionin miss rate

problem cam is wasteful slow
Problem – CAM is wasteful & slow
  • Fully-associative filter uses CAM (content addressable memory)
    • Input = address; output (on a hit) = “pointer” into SRAM saying where’s the associated data
    • CAM lookup done in parallel
  • Parallel lookup drawbacks
    • Wastes energy
    • Is slower (relative to direct-mapped)
  • Possible solution
    • Introducing the “WLB”…
wordline l ook aside buffer wlb
Wordline look-aside Buffer (WLB)
  • WLB is a small direct-mapped lookup table caching the most recent CAM lookups
    • (Recall: given an address, CAM returns a pointer into SRAM; it’s a search like any search and therefore can be cached)
    • Fast, low-power lookups
  • Filter usage when addingto it the WLB
    • First, search data in L1
    • In parallel search its address in WLB
    • If data not in L1 but WLB hits
      • Access the SRAM without CAM
    • If data not in L1 and WLB misses
      • Only then use the slower / wasteful CAM
    • If not found, go to L2 as usual
effectiveness of wlb
Effectiveness of WLB?
  • WLB is quite effective with only 8 entries (for both I and D)
    • Eliminates 77% of CAM data lookups
    • Eliminates 98% of CAM instructions lookups
  • Since WLB is so small and simple (direct map)
    • It’s fast and consumes extremely low power
    • Therefore, it can be looked up in parallel with main L1 cache

Instruction

Data

size of WLB [number of entries]

methodology1
Methodology
  • 4 wide, out-of-order micro-architecture (SimpleScalar)
    • (You’ll understand this when we learn out-of-order execution)
  • Simulated L1
    • 16K, 32K, 64K, with several set-associative configuration; latency:
      • Direct-mapped: 1 cycle
      • Set-associative: 2 cycles
  • Simulated filter
    • 2K, fully-associative, with 8-entry WLB; latency: 5 cycles
      • 1 cycle = for WLB (in parallel to accessing the cache)
      • 3 cycles = for CAM lookup
      • 1 cycle = for SRAM access
  • Simulated L2
    • 512K; latency: 16 cycles
  • Simulated main-memory
    • Latency: 350 cycles
results runtime
Comparing random sampling filter cache to other common cache designs

Outperforms a 4-way cache double its size!

Interesting: DM’s low-latency compensates for conflict misses

Results – runtime

32K DM + filter

16K DM + filter

average relative improvement [%]

results power c onsumption
Results – power consumption
  • Expectedly, DM-filtered loses to DM, because it’s more complex
  • Direct mapped cache reduces dynamic power, but filter adds ~15% more leakage over 4-way
  • Same size: 60%-80% reduction in dynamic power
  • Double size: ~40% reduction in leakage
conclusions
Conclusions
  • The Mass-Count disparity phenomenon can be leveraged for caching policies
  • Random Sampling effectively identifies frequently used blocks
    • Adding just 2K filter is better than doubling the cache size, both in terms of IPC and power
  • The WLB is effective at eliminating costly CAM lookups
    • Offering fast, low-power access while maintaining fully-associativity benefits
ad