Computer architecture probabilistic l1 cache filtering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Computer Architecture Probabilistic L1 Cache Filtering PowerPoint PPT Presentation


  • 28 Views
  • Uploaded on
  • Presentation posted in: General

Computer Architecture Probabilistic L1 Cache Filtering. By Dan Tsafrir 7/5/2012 Presentation based on slides by Yoav Etsion. Lecture is based on…. P aper titled “L1 c ache f iltering t hrough r andom s election of memory r eferences” Authors

Download Presentation

Computer Architecture Probabilistic L1 Cache Filtering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computer architecture probabilistic l1 cache filtering

Computer ArchitectureProbabilistic L1 Cache Filtering

By Dan Tsafrir 7/5/2012Presentation based on slides by Yoav Etsion


Lecture is based on

Lecture is based on…

  • Paper titled

    • “L1 cache filtering through random selection of memory references”

  • Authors

    • Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)

  • Published in

    • PACT 2007: the international conference on parallel architectures and complication techniques

  • Can be downloaded from

    • http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf


Motivation

Motivation


A case for more e fficient c aches

A Case for more efficient caches

  • CPUs get more and more cache dependent

    • Growing gap between CPU and memory

    • Growing popularity of multi-core chips

  • Common solution: larger caches, but…

    • Larger caches consume more power

    • Longer wire delays yield longer latencies


Efficiency through i nsertion p olicies

Efficiency through insertion policies

  • Need smarter caches

  • We focus on insertion policies

    • Currently everything goes into the cache

    • Need to predict which blocks will be evicted quickly…

    • … and prevent them from polluting the caches

  • Reducing pollution may enable use of low-latency, low-power, direct-mapped caches

    • Less pollution yields fewer conflicts

  • The Problem:

    • “It is difficult to make predictions, especially about the future”(many affiliations)


Cdf residency lengths mass count disparity

background

CDF,Residency lengths,mass-count disparity


Pdf probability d istribution f unction

PDF(probability distribution function)

  • In statistics, a PDF is, roughly, a function f describing

    • The likelihood to get some value in some domain

  • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters

    • f(1) = 0%f(2) = 22% (דן, רם, שי, שי, שי, שי, גל, חן, חן, בן, גד, טל, לי, ...)f(3) = 24% (גיל, גיל, רון, שיר, שיר, שיר, שיר, נגה, משה, חיה, רחל, חנה, ...)f(4) = 25% (יואב, אחמד, אביב, מיכל, מיכל, נועה, נועה, נועה, נועה, נועה,... )f(5) = 13% (אביטל, ירוחם, עירית, יהודה, חנניה, אביבה, אביתר, אביעד, ... )f(6) = 9% (יחזקאל, אביבית, אבינעם, אביגיל, שלומית, אבשלום, אדמונד, ... )f(7) = 6% (אביגדור, אבינועם, מתיתיהו,, עמנואלה, אנסטסיה... )f(8) = 0.6% (אלכסנדרה, ...)f(9) = 0.4% (קונסטנטין, ...)f(10) = 0

  • Note that sigma[ f(k) ] = 100%


C df cumulative d istribution f unction

CDF(cumulative distribution function)

  • In statistics, a CDF is, roughly, a function F describing

    • The likelihood to get some value in some domain, or less

  • For example, f can specify how many students have a first name comprised of exactly k Hebrew letters, or less

    • F(1) = 0%= f(0)= 0%F(2) = 22%= f(0)+f(1)+f(2)= 0%+22%F(3) = 46%= f(0)+f(1)+f(2)= 0%+22%+24%F(4) = 71%= f(0)+f(1)+f(2)+f(3)= 0%+22%+24%+25%F(5) = 84%= F(4)+f(5)= 71% + 13%F(6) = 93%= F(5)+f(6)= 84% + 9%F(7) = 99%= F(6)+f(7)= 93% + 6%F(8) = 99.6%= F(7)+f(8)= 99% + 0.6%F(9) = 100%= F(8)+f(9)= 99.6% + 0.4%F(10) = 100%

  • Generally, F(x) =, monotonically non-decreasing


Cache r esidency

Cache residency

  • A “residency”

    • Is what we call a block of memory

      • From the time it was inserted to the cache

      • Until the time it was evicted

    • Each memory block can be associated with many residencies during a single execution

  • “Residency length”

    • Number of memory references (= load/store operation) served by the residency

  • “The mass of residency length=k”

    • Percent of memory references (throughout the entire program execution) that were served by residencies of length k


Computing residency length on the fly

Computing residency length on-the-fly

  • At runtime, residency length is generated like so (assume C++):

    class Cache_Line_Residency {

    private:

    int counter;// the residency length

    public:

    Cache_Line_Residency() {// constructor: a new object allocated// when a cache-line is allocated for a counter = 0; // newly inserted memory block}~Cache_Line() {// destructor: called when the block is// evicted from the cache (or when the// program ends)cout << counter << endl;}void do_reference() {// invoked whenever the cache line is// referenced (read from or written to)

    counter++;}};

ctor

dtor

accessmemory


Example

Example

  • Assume:

    • Size of cache: 4 bytes

    • Size of cache line: 2 bytes (namely, there are two lines)

    • Cache is directly mapped=> address x maps into x % 4

    • A program references memory (order: top to bottom):

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(1)

(2)

(2)

(1)

(1)

(2)

(2)

(2)

(2)

(1)

(2)

(1)

(1)

(1)

(3)

(3)

(2)

(90)

(2)

(3)

(2)

residency #1

#2

print 3

print 2

addresses (of memory references)

#3

print 3

#5

#4

print 90

print 2

program end


Example cdf of residency lengths

Example – CDF of residency lengths

So printed residency lengths are: 3, 2, 3, 90, 2

  • Thus, CDF of residency length is:

  • 40% of residencies have length <= 2= |[2,2]| / |[3,2,3,90,2]|

  • 80% of residencies have length <= 3= |[2,2,3,3]| / |[3,2,3,90,2]|

  • 100% of residencies have length <= 90= |[2,2,3,3,90]| / |[3,2,3,90,2]|

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end


Example cdf of mass of references

Example – CDF of mass of references

So printed residency lengths are: 3, 2, 3, 90, 2

  • Thus, CDF of mass of references (“refs”) is:

  • 4% of refs are to residencies with length <= 2= (2+2) / (3+2+3+90+2)

  • 10% of refs are to residencies with len <= 3= (2+2+3+3) / (3+2+3+90+2)

  • 100% of refs are to residencies w len <= 90= (2+2+3+3+90) / (3+2+3+90+2)

013304744600…0

x

x

(1)

(2)

x

x

x

x

x

x

x

x

x

x

x

(2)

(2)

(1)

(1)

(2)

(2)

(1)

(2)

(2)

(2)

(1)

(2)

(3)

(1)

(3)

(2)

(3)

(2)

(1)

(2)

(1)

(90)

residency #1

#2

print 3

100%

print 2

addresses (of memory references)

80%

#3

60%

CDF

print 3

#5

40%

#4

20%

print 90

90

2

3

print 2

lengths

program end


Superimposing graphs

Superimposing graphs

100%

residency lengths

80%

the“counters”

60%

the“mass”

CDF

memory references

40%

20%

90

2

3

lengths

  • “mass-count disparity” (disparity = שׁוֹנוּת, שֹׁנִי, נִבְדָּלוּת) is theterm describing the phenomenon shown in the graph, whereby:

  • most of the mass resides in very few counters, and

  • most of the counters count very little mass


Results from real benchmarks

Results fromreal benchmarks


Methodology

Methodology

  • Using all benchmarks from the SPEC-CPU 2000 benchmarks suite

    • In this presentation we show only four

    • But we include all the rest in the averages

  • The benchmarks were compiled for

    • The Alpha AXP architecture

  • All benchmarks were fast-forwarded 15 billion instructions (to skip any initialization code) and were then executed for another 2 billion instructions

  • Unless otherwise stated, all simulated runs utilize a 16KB direct-mapped cache


Cdf of residency length of 4 spec benchmark apps

Vast majority of residencies are relatively short

Which likely means they are transient

Small fraction of residencies are extremely long

CDF of residency length(of 4 SPEC benchmark apps)

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency


Cdf of mass of residencies of 4 spec benchmark apps

Fraction of memory references serviced by each length

Most references target residencies longer than, say, 10

CDF of mass of residencies(of 4 SPEC benchmark apps)

Crafty

Vortex

Facerec

Spsl

data

CDF

instructions

length of residency


Superimposing graphs reveals mass count disparity

Superimposing graphs reveals mass-count disparity

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

  • Every x value along the curves reveals how many of the residencies account for how much of the mass

  • For example, in Crafty, 55% if the (shortest) residencies account for only 5% of the mass

  • Which means the other 45% (longer) residencies account for 95% of the mass


The joint ratio mass disparity metric

The divergence between the distributions (= the mass-count disparity) can be quantified by the “joint ratio”

It’s a generalization of the proverbial 20/80 principle

Definition: the joint ratio is the unique point in the graphs where the sum of the two CDFs is 1

Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in middle of plot), meaning 13% of the (longest) residencies hold 87% of the mass of the memory references, while the remaining 87% of the residencies hold only 13% of the mass

The joint-ratiomass-disparity metric

Joint-Ratio

Count

Crafty

Crafty

Vortex

Vortex

Facerec

Facerec

Spsl

Spsl

data

Mass

CDF

instructions

length of residency


The w 1 2 mass disparity metric

The W1/2 mass-disparity metric

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

W1/2

CDF

instructions

length of residency

  • Definition: overall mass (in %) of the shorter half of the residencies

  • Example: in Vortex and Facerec W1/2 is less than 5% of the references

  • Average W1/2 across all benchmarks is < 10% (median of W1/2is < 5%)


The n 1 2 mass disparity metric

The N1/2 mass-disparity metric

N1/2

Count

Crafty

Vortex

Facerec

Spsl

data

Mass

CDF

instructions

length of residency

  • Definition: % of longer residencies accounting for half of the mass

  • Example: in Vortex and Facerec N1/2 is less than 1% of the references

  • Average N1/2 across all benchmarks is < 5% (median of N1/2is < 1%)


Designing a new insertion policy

Let us utilize our understandings for…

Designing a new insertion policy


Probabilistic insertion

Probabilistic insertion?

  • The mass-disparity we’ve identified means

    • A small number of long residencies account for most memory references; but still most residencies are short

  • So when randomly selecting a residency

    • It would likely be a short residency

  • Which means we have a way to approximate the future:

    • Given a block about to be inserted into cache, probabilistically speaking, we know with high degree of certainty that it’d be disadvantageous to actually insert it…

    • So we won’t! Instead, we’ll flip a coin…

      • Heads = insert block to cache (small probability)

      • Tails = insert block to a small filter (high probability)

  • Rationale

    • Long residencies will enjoy many coin-flips, so chances are they’ll eventually get into the cache

    • Conversely, short residencies have little chance to get in


L1 with random filter

L1 with random filter

  • Design

    • Direct-mapped L1 + small fully-associative filter w/ CAM

    • Insertion policy for lines not in L1: for each mem ref, flip biased coin to decide if line goes into filter or into L1

  • SRAM is cache memory

    • Not to be confused with DRAM

    • Holds blocks that, by the coin flip, shouldn’t be inserted to L1

  • Usage

    • First, search data in L1

    • If not found, search in filter

    • If not found, go to L2, and then use above insertion policy


L1 with random filter1

L1 with random filter

  • Result

    • Long residencies end up in L1

    • Short residencies tend to end up in filter

  • Benefit of randomness

    • Filtering is purely statistical, eliminating the need to save any state or reuse information!

  • Explored filter sizes

    • 1KB, 2KB, and 4KB

    • Consisting of 16, 32, and 64 lines, respectively

    • Results presented in slides: were achieved using a 2K filter


Exploring coin bias

Find the probability minimizing the miss-rate

High probability swamps cache

Low probability swamps filter

Constant selection probabilities seem sufficient

Data miss-rate reduced by ~25% for P = 5/100

Inst. miss-rate reduced by >60% for P = 1/1000

Exploring coin bias

Data

Instruction

Reductionin miss rate


Exploring coin bias1

Random sampling with probability P turned out equivalent toperiodic sampling at a rate of ~1/P

Do not need real randomness

Majority of memory refs serviced by L1 cache, whereas majority of blocks remain in the filter; specifically:

L1 services 80% - 90% of refs

With only ~35% of the blocks

Exploring coin bias

Data

Instruction

Reductionin miss rate


Problem cam is wasteful slow

Problem – CAM is wasteful & slow

  • Fully-associative filter uses CAM (content addressable memory)

    • Input = address; output (on a hit) = “pointer” into SRAM saying where’s the associated data

    • CAM lookup done in parallel

  • Parallel lookup drawbacks

    • Wastes energy

    • Is slower (relative to direct-mapped)

  • Possible solution

    • Introducing the “WLB”…


Wordline l ook aside buffer wlb

Wordline look-aside Buffer (WLB)

  • WLB is a small direct-mapped lookup table caching the most recent CAM lookups

    • (Recall: given an address, CAM returns a pointer into SRAM; it’s a search like any search and therefore can be cached)

    • Fast, low-power lookups

  • Filter usage when addingto it the WLB

    • First, search data in L1

    • In parallel search its address in WLB

    • If data not in L1 but WLB hits

      • Access the SRAM without CAM

    • If data not in L1 and WLB misses

      • Only then use the slower / wasteful CAM

    • If not found, go to L2 as usual


Effectiveness of wlb

Effectiveness of WLB?

  • WLB is quite effective with only 8 entries (for both I and D)

    • Eliminates 77% of CAM data lookups

    • Eliminates 98% of CAM instructions lookups

  • Since WLB is so small and simple (direct map)

    • It’s fast and consumes extremely low power

    • Therefore, it can be looked up in parallel with main L1 cache

Instruction

Data

size of WLB [number of entries]


Performance evaluation

performanceevaluation


Methodology1

Methodology

  • 4 wide, out-of-order micro-architecture (SimpleScalar)

    • (You’ll understand this when we learn out-of-order execution)

  • Simulated L1

    • 16K, 32K, 64K, with several set-associative configuration; latency:

      • Direct-mapped: 1 cycle

      • Set-associative: 2 cycles

  • Simulated filter

    • 2K, fully-associative, with 8-entry WLB; latency: 5 cycles

      • 1 cycle = for WLB (in parallel to accessing the cache)

      • 3 cycles = for CAM lookup

      • 1 cycle = for SRAM access

  • Simulated L2

    • 512K; latency: 16 cycles

  • Simulated main-memory

    • Latency: 350 cycles


Results runtime

Comparing random sampling filter cache to other common cache designs

Outperforms a 4-way cache double its size!

Interesting: DM’s low-latency compensates for conflict misses

Results – runtime

32K DM + filter

16K DM + filter

average relative improvement [%]


Results power c onsumption

Results – power consumption

  • Expectedly, DM-filtered loses to DM, because it’s more complex

  • Direct mapped cache reduces dynamic power, but filter adds ~15% more leakage over 4-way

  • Same size: 60%-80% reduction in dynamic power

  • Double size: ~40% reduction in leakage


Conclusions

Conclusions

  • The Mass-Count disparity phenomenon can be leveraged for caching policies

  • Random Sampling effectively identifies frequently used blocks

    • Adding just 2K filter is better than doubling the cache size, both in terms of IPC and power

  • The WLB is effective at eliminating costly CAM lookups

    • Offering fast, low-power access while maintaining fully-associativity benefits


  • Login