Utility-Based Partitioning of Shared Caches
Download
1 / 32

Utility-Based Partitioning of Shared Caches - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Utility-Based Partitioning of Shared Caches. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Microarchitecture (MICRO) 2006. Introduction. CMP and shared caches are common Applications compete for the shared cache Partitioning policies critical for high performance

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Utility-Based Partitioning of Shared Caches' - ronia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Utility-Based Partitioning of Shared Caches

Moinuddin K. QureshiYale N. Patt

International Symposium on Microarchitecture (MICRO) 2006


Introduction
Introduction

  • CMP and shared caches are common

  • Applications compete for the shared cache

  • Partitioning policies critical for high performance

  • Traditional policies:

    • Equal (half-and-half)  Performance isolation. No adaptation

    • LRU  Demand based. Demand ≠ benefit (e.g. streaming)


Background
Background

Utility Uab = Misses with a ways – Misses with b ways

Low Utility

Misses per 1000 instructions

High Utility

Saturating Utility

Num ways from 16-way 1MB L2


Motivation
Motivation

equake

vpr

Improve performance by giving more cache to the application that benefits more from cache

UTIL

Misses per 1000 instructions (MPKI)

LRU

Num ways from 16-way 1MB L2


Outline
Outline

  • Introduction and Motivation

  • Utility-Based Cache Partitioning

  • Evaluation

  • Scalable Partitioning Algorithm

  • Related Work and Summary


Framework for ucp

PA

UMON2

UMON1

Framework for UCP

Shared

L2 cache

I$

I$

Core1

Core2

D$

D$

Main Memory

Three components:

  • Utility Monitors (UMON) per core

  • Partitioning Algorithm (PA)

  • Replacement support to enforce partitions


Utility monitors umon

(MRU)H0 H1 H2…H15(LRU)

+

+

+

+

Set A

Set A

Set B

Set B

Set C

Set C

ATD

Set D

Set D

Set E

Set E

Set F

Set F

Set G

Set G

Set H

Set H

Utility Monitors (UMON)

  • For each core, simulate LRU policy using ATD

  • Hit counters in ATD to count hits per recency position

  • LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1

MTD


Dynamic set sampling dss

Set A

Set B

Set B

Set E

Set C

Set G

Set D

Set E

Set F

Set G

Set H

(MRU)H0 H1 H2…H15(LRU)

+

+

+

+

Set A

Set A

Set B

Set B

Set C

Set C

Set D

Set D

Set E

Set E

Set F

Set F

Set G

Set G

Set H

Set H

Dynamic Set Sampling (DSS)

  • Extra tags incurhardware and power overhead

  • DSS reduces overhead [Qureshi+ ISCA’06]

  • 32 sets sufficient (analytical bounds)

  • Storage < 2kB/UMON

MTD

ATD

UMON (DSS)


Partitioning algorithm
Partitioning algorithm

  • Evaluate all possible partitions and select the best

  • With aways to core1 and (16-a) ways to core2:

    Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2

  • Select a that maximizes (Hitscore1 + Hitscore2)

  • Partitioning done once every 5 million cycles


Way partitioning

ways_occupied < ways_given

Yes

No

Victim is the LRU line from miss-causing app

Victim is the LRU line from other app

Way Partitioning

Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04]

  • Each line has core-id bits

  • On a miss, count ways_occupied in set by miss-causing app


Outline1
Outline

  • Introduction and Motivation

  • Utility-Based Cache Partitioning

  • Evaluation

  • Scalable Partitioning Algorithm

  • Related Work and Summary


Methodology

Benchmarks: Two-threaded workloads divided into 5 categories

1.0

1.2

1.6

1.4

1.8

2.0

Weighted speedup for the baseline

Used 20 workloads (four from each type)

Methodology

Configuration: Two cores: 8-wide, 128-entry window, private L1s L2: Shared, unified, 1MB, 16-way, LRU-based Memory: 400 cycles, 32 banks


Metrics
Metrics Two-threaded workloads divided into 5 categories

  • Three metrics for performance:

  • Weighted Speedup (default metric) perf = IPC1/SingleIPC1+ IPC2/SingleIPC2 correlates with reduction in execution time

  • Throughput  perf = IPC1 +IPC2 can be unfair to low-IPC application

  • Hmean-fairness  perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)  balances fairness and performance


Results for weighted speedup
Results for weighted speedup Two-threaded workloads divided into 5 categories

UCP improves average weighted speedup by 11%


Results for throughput
Results for throughput Two-threaded workloads divided into 5 categories

UCP improves average throughput by 17%


Results for hmean fairness
Results for hmean-fairness Two-threaded workloads divided into 5 categories

UCP improves average hmean-fairness by 11%


Effect of number of sampled sets
Effect of Number of Sampled Sets Two-threaded workloads divided into 5 categories

8 sets

16 sets

32 sets

All sets

Dynamic Set Sampling (DSS) reduces overhead, not benefits


Outline2
Outline Two-threaded workloads divided into 5 categories

  • Introduction and Motivation

  • Utility-Based Cache Partitioning

  • Evaluation

  • Scalable Partitioning Algorithm

  • Related Work and Summary


Scalability issues
Scalability issues Two-threaded workloads divided into 5 categories

  • Time complexity of partitioning low for two cores(number of possible partitions ≈ number of ways)

  • Possible partitions increase exponentially with cores

  • For a 32-way cache, possible partitions:

    • 4 cores  6545

    • 8 cores  15.4 million

  • Problem NP hard  need scalable partitioning algorithm


Greedy algorithm stone toc 92
Greedy Algorithm Two-threaded workloads divided into 5 categories [Stone+ ToC ’92]

  • GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated

  • Optimal partitioning when utility curves are convex

  • Pathological behavior for non-convex curves

Misses per 100 instructions

Num ways from a 32-way 2MB L2


Problem with greedy algorithm
Problem with Greedy Algorithm Two-threaded workloads divided into 5 categories

In each iteration, the utility for 1 block:

U(A) = 10 misses U(B) = 0 misses

Misses

All blocks assigned to A, even if B has same miss reduction with fewer blocks

Blocks assigned

Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead


Lookahead algorithm
Lookahead Algorithm Two-threaded workloads divided into 5 categories

  • Marginal Utility (MU) = Utility per cache resource

    MUab = Uab/(b-a)

  • GA considers MU for 1 block. LA considers MU for all possible allocations

  • Select the app that has the max value for MU. Allocate it as many blocks required to get max MU

  • Repeat till all blocks assigned


Lookahead algorithm example
Lookahead Algorithm (example) Two-threaded workloads divided into 5 categories

Iteration 1:

MU(A) = 10/1 block MU(B) = 80/3 blocks

B gets 3 blocks

Time complexity ≈ways2/2 (512 ops for 32-ways)

Misses

Next five iterations:

MU(A) = 10/1 block

MU(B) = 0

A gets 1 block

Blocks assigned

Result: A gets 5 blocks and B gets 3 blocks (Optimal)


Results for partitioning algorithms
Results for partitioning algorithms Two-threaded workloads divided into 5 categories

Four cores sharing a 2MB 32-way L2

LRU

UCP(Greedy)

UCP(Lookahead)

UCP(EvalAll)

Mix1

(gap-applu-apsi-gzp)

Mix4

(mcf-art-eqk-wupw)

Mix3

(mcf-applu-art-vrtx)

Mix2

(swm-glg-mesa-prl)

LA performs similar to EvalAll, with low time-complexity


Outline3
Outline Two-threaded workloads divided into 5 categories

  • Introduction and Motivation

  • Utility-Based Cache Partitioning

  • Evaluation

  • Scalable Partitioning Algorithm

  • Related Work and Summary


Related work
Related work Two-threaded workloads divided into 5 categories

Performance

Low

High

Suh+ [HPCA’02] Perf += 4% Storage += 32B/core

UCP Perf += 11% Storage += 2kB/core

Low

Overhead

X

Zhou+ [ASPLOS’04] Perf += 11% Storage += 64kB/core

High

UCP is both high-performance and low-overhead


Summary
Summary Two-threaded workloads divided into 5 categories

  • CMP and shared caches are common

  • Partition shared caches based on utility, not demand

  • UMON estimates utility at runtime with low overhead

  • UCP improves performance:

    • Weighted speedup by 11%

    • Throughput by 17%

    • Hmean-fairness by 11%

  • Lookahead algorithm is scalable to many cores sharing a highly associative cache


  • Questions
    Questions Two-threaded workloads divided into 5 categories


    Dss bounds with analytical model
    DSS Bounds with Analytical Model Two-threaded workloads divided into 5 categories

    Us = Sampled mean (Num ways allocated by DSS)

    Ug = Global mean (Num ways allocated by Global)

    P = P(Us within 1 way of Ug)

    By Cheb. inequality:

    P ≥ 1 – variance/n

    n = number of sampled sets

    In general, variance ≤ 3

    back


    Phase based adapt of ucp
    Phase-Based Adapt of UCP Two-threaded workloads divided into 5 categories


    Galgel concave utility
    Galgel – concave utility Two-threaded workloads divided into 5 categories

    galgel

    twolf

    parser


    Lru as a stack algorithm
    LRU as a stack algorithm Two-threaded workloads divided into 5 categories


    ad