Cooperative caching for chip multiprocessors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Cooperative Caching for Chip Multiprocessors PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Cooperative Caching for Chip Multiprocessors. Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006. Motivation. Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs Two important demands for CMP caching

Download Presentation

Cooperative Caching for Chip Multiprocessors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cooperative caching for chip multiprocessors

Cooperative Caching forChip Multiprocessors

Jichuan Chang

Guri Sohi

University of Wisconsin-Madison

ISCA-33, June 2006


Motivation

Motivation

  • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs

  • Two important demands for CMP caching

    • Capacity: reduce off-chip accesses

    • Latency: reduce remote on-chip references

  • Need to combine the strength of both private and shared cache designs


Yet another hybrid cmp cache why

Yet Another Hybrid CMP Cache - Why?

  • Private cache based design

    • Lower latency and per-cache associativity

    • Lower cross-chip bandwidth requirement

    • Self-contained for resource management

    • Easier to support QoS, fairness, and priority

  • Need a unified framework

    • Manage the aggregate on-chip cache resources

    • Can be adopted by different coherence protocols


Cmp cooperative caching

P

P

L1I

L1D

L1I

L1D

L2

L2

Network

L2

L2

L1I

L1D

L1I

L1D

P

P

CMP Cooperative Caching

  • Form an aggregate global cache via cooperative private caches

    • Use private caches to attract data for fast reuse

    • Share capacity through cooperative policies

    • Throttle cooperation to find an optimal sharing point

  • Inspired by cooperative file/web caches

    • Similar latency tradeoff

    • Similar algorithms


Outline

Outline

  • Introduction

  • CMP Cooperative Caching

  • Hardware Implementation

  • Performance Evaluation

  • Conclusion


Policies to reduce off chip accesses

Policies to Reduce Off-chip Accesses

  • Cooperation policies for capacity sharing

    • (1) Cache-to-cache transfers of clean data

    • (2) Replication-aware replacement

    • (3) Global replacement of inactive data

  • Implemented by two unified techniques

    • Policies enforced by cache replacement/placement

    • Information/data exchange supported by modifying the coherence protocol


Policy 1 make use of all on chip data

Policy (1) - Make use of all on-chip data

  • Don’t go off-chip if on-chip (clean) data exist

  • Beneficial and practical for CMPs

    • Peer cache is much closer than next-level storage

    • Affordable implementations of “clean ownership”

  • Important for all workloads

    • Multi-threaded: (mostly) read-only shared data

    • Single-threaded: spill into peer caches for later reuse


Policy 2 control replication

Policy (2) – Control replication

  • Intuition – increase # of unique on-chip data

  • Latency/capacity tradeoff

    • Evict singlets only when no replications exist

    • Modify the default cache replacement policy

  • “Spill” an evicted singlet into a peer cache

    • Can further reduce on-chip replication

    • Randomly choose a recipient cache for spilling

“singlets”


Policy 3 global cache management

Policy (3) - Global cache management

  • Approximate global-LRU replacement

    • Combine global spill/reuse history with local LRU

  • Identify and replace globally inactive data

    • First become the LRU entry in the local cache

    • Set as MRU if spilled into a peer cache

    • Later become LRU entry again: evict globally

  • 1-chance forwarding (1-Fwd)

    • Blocks can only be spilled once if not reused


Cooperation throttling

CC 0%

Policy (1)

Cooperation Throttling

  • Why throttling?

    • Further tradeoff between capacity/latency

  • Two probabilities to help make decisions

    • Cooperation probability: control replication

    • Spill probability: throttle spilling

Shared

Private

CC 100%

Cooperative Caching


Outline1

Outline

  • Introduction

  • CMP Cooperative Caching

  • Hardware Implementation

  • Performance Evaluation

  • Conclusion


Hardware implementation

Hardware Implementation

  • Requirements

    • Information: singlet, spill/reuse history

    • Cache replacement policy

    • Coherence protocol: clean owner and spilling

  • Can modify an existing implementation

  • Proposed implementation

    • Central Coherence Engine (CCE)

    • On-chip directory by duplicating tag arrays


Information and data exchange

Information and Data Exchange

  • Singlet information

    • Directory detects and notifies the block owner

  • Sharing of clean data

    • PUTS: notify directory of clean data replacement

    • Directory sends forward request to the first sharer

  • Spilling

    • Currently implemented as a 2-step data transfer

    • Can be implemented as recipient-issued prefetch


Outline2

Outline

  • Introduction

  • CMP Cooperative Caching

  • Hardware Implementation

  • Performance Evaluation

  • Conclusion


Performance evaluation

Performance Evaluation

  • Full system simulator

    • Modified GEMS Ruby to simulate memory hierarchy

    • Simics MAI-based OoO processor simulator

  • Workloads

    • Multithreaded commercial benchmarks (8-core)

      • OLTP, Apache, JBB, Zeus

    • Multiprogrammed SPEC2000 benchmarks (4-core)

      • 4 heterogeneous, 2 homogeneous

  • Private / shared / cooperative schemes

    • Same total capacity/associativity


Multithreaded workloads throughput

Multithreaded Workloads - Throughput

  • CC throttling - 0%, 30%, 70% and 100%

  • Ideal – Shared cache with local bank latency


Multithreaded workloads avg latency

Multithreaded Workloads - Avg. Latency

  • Low off-chip miss rate

  • High hit ratio to local L2

  • Lower bandwidth needed than a shared cache


Multiprogrammed workloads

Off-chip

Off-chip

Remote L2

Remote L2

Local L2

Local L2

L1

L1

Multiprogrammed Workloads


Sensitivity varied configurations

Sensitivity - Varied Configurations

  • In-order processor with blocking caches

Performance Normalized to Shared Cache


Comparison with victim replication

Single-threaded

SPECOMP

Comparison with Victim Replication


A spectrum of related research

Hybrid Schemes(CMP-NUCA proposals, Victim replication, Speight et al. ISCA’05, CMP-NuRAPID, Fast & Fair, synergistic caching, etc)

… …

Cooperative Caching

Configurable/malleable Caches

(Liu et al. HPCA’04, Huh et al. ICS’05, cache partitioning proposals, etc)

A Spectrum of Related Research

  • Speight et al. ISCA’05

  • - Private caches

  • - Writeback into peer caches

Shared

Caches

Private

Caches

Best latency

Best capacity


Conclusion

Conclusion

  • CMP cooperative caching

    • Exploit benefits of private cache based design

    • Capacity sharing through explicit cooperation

    • Cache replacement/placement policies for replication control and global management

    • Robust performance improvement

Thank you!


Backup slides

Backup Slides


Shared vs private caches for cmp

  • Private cache – best latency

    • High local hit ratio, no interference from other cores

  • Need to combine the strengths of both

Private

Shared

Desired

Shared vs. Private Caches for CMP

  • Shared cache – best capacity

    • No replication, dynamic capacity sharing

Duplication


Cmp caching challenges opportunities

CMP Caching Challenges/Opportunities

  • Future hardware

    • More cores, limited on-chip cache capacity

    • On-chip wire-delay, costly off-chip accesses

  • Future software

    • Diverse sharing patterns, working set sizes

    • Interference due to capacity sharing

  • Opportunities

    • Freedom in designing on-chip structures/interfaces

    • High-bandwidth, low-latency on-chip communication


Multi cluster cmps scalability

Multi-cluster CMPs (Scalability)

P

P

P

P

L1I

L1D

L1D

L1I

L1I

L1D

L1D

L1I

L2

L2

L2

L2

CCE

CCE

L2

L2

L2

L2

L1

L1D

L1D

L1

L1

L1D

L1D

L1

P

P

P

P

P

P

P

P

L1

L1D

L1D

L1

L1

L1D

L1D

L1

L2

L2

L2

L2

CCE

CCE

L2

L2

L2

L2

L1

L1D

L1D

L1

L1

L1D

L1D

L1

P

P

P

P


Central coherence engine cce

P1

P8

… …

L1 Tag

Array

… …

L2 Tag

Array

4-to-1

mux

4-to-1

mux

State vector gather

State vector

Central Coherence Engine (CCE)

...

...

Network

Output queue

Directory

Memory

Spilling

buffer

Input queue

...

...

Network


Cache network configurations

Cache/Network Configurations

  • Parameters

    • 128-byte block size; 300-cycle memory latency

    • L1 I/D split caches: 32KB, 2-way, 2-cycle

    • L2 caches: Sequential tag/data access, 15-cycle

    • On-chip network: 5-cycle per-hop

  • Configurations

    • 4-core: 2X2 mesh network

    • 8-core: 3X3 or 4X2 mesh network

    • Same total capacity/associativity for private/shared/cooperative caching schemes


  • Login