cooperative caching for chip multiprocessors
Download
Skip this Video
Download Presentation
Cooperative Caching for Chip Multiprocessors

Loading in 2 Seconds...

play fullscreen
1 / 28

Cooperative Caching for Chip Multiprocessors - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Cooperative Caching for Chip Multiprocessors. Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006. Motivation. Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs Two important demands for CMP caching

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Cooperative Caching for Chip Multiprocessors' - reuben-simmons


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cooperative caching for chip multiprocessors

Cooperative Caching forChip Multiprocessors

Jichuan Chang

Guri Sohi

University of Wisconsin-Madison

ISCA-33, June 2006

motivation
Motivation
  • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs
  • Two important demands for CMP caching
    • Capacity: reduce off-chip accesses
    • Latency: reduce remote on-chip references
  • Need to combine the strength of both private and shared cache designs
yet another hybrid cmp cache why
Yet Another Hybrid CMP Cache - Why?
  • Private cache based design
    • Lower latency and per-cache associativity
    • Lower cross-chip bandwidth requirement
    • Self-contained for resource management
    • Easier to support QoS, fairness, and priority
  • Need a unified framework
    • Manage the aggregate on-chip cache resources
    • Can be adopted by different coherence protocols
cmp cooperative caching

P

P

L1I

L1D

L1I

L1D

L2

L2

Network

L2

L2

L1I

L1D

L1I

L1D

P

P

CMP Cooperative Caching
  • Form an aggregate global cache via cooperative private caches
    • Use private caches to attract data for fast reuse
    • Share capacity through cooperative policies
    • Throttle cooperation to find an optimal sharing point
  • Inspired by cooperative file/web caches
    • Similar latency tradeoff
    • Similar algorithms
outline
Outline
  • Introduction
  • CMP Cooperative Caching
  • Hardware Implementation
  • Performance Evaluation
  • Conclusion
policies to reduce off chip accesses
Policies to Reduce Off-chip Accesses
  • Cooperation policies for capacity sharing
    • (1) Cache-to-cache transfers of clean data
    • (2) Replication-aware replacement
    • (3) Global replacement of inactive data
  • Implemented by two unified techniques
    • Policies enforced by cache replacement/placement
    • Information/data exchange supported by modifying the coherence protocol
policy 1 make use of all on chip data
Policy (1) - Make use of all on-chip data
  • Don’t go off-chip if on-chip (clean) data exist
  • Beneficial and practical for CMPs
    • Peer cache is much closer than next-level storage
    • Affordable implementations of “clean ownership”
  • Important for all workloads
    • Multi-threaded: (mostly) read-only shared data
    • Single-threaded: spill into peer caches for later reuse
policy 2 control replication
Policy (2) – Control replication
  • Intuition – increase # of unique on-chip data
  • Latency/capacity tradeoff
    • Evict singlets only when no replications exist
    • Modify the default cache replacement policy
  • “Spill” an evicted singlet into a peer cache
    • Can further reduce on-chip replication
    • Randomly choose a recipient cache for spilling

“singlets”

policy 3 global cache management
Policy (3) - Global cache management
  • Approximate global-LRU replacement
    • Combine global spill/reuse history with local LRU
  • Identify and replace globally inactive data
    • First become the LRU entry in the local cache
    • Set as MRU if spilled into a peer cache
    • Later become LRU entry again: evict globally
  • 1-chance forwarding (1-Fwd)
    • Blocks can only be spilled once if not reused
cooperation throttling

CC 0%

Policy (1)

Cooperation Throttling
  • Why throttling?
    • Further tradeoff between capacity/latency
  • Two probabilities to help make decisions
    • Cooperation probability: control replication
    • Spill probability: throttle spilling

Shared

Private

CC 100%

Cooperative Caching

outline1
Outline
  • Introduction
  • CMP Cooperative Caching
  • Hardware Implementation
  • Performance Evaluation
  • Conclusion
hardware implementation
Hardware Implementation
  • Requirements
    • Information: singlet, spill/reuse history
    • Cache replacement policy
    • Coherence protocol: clean owner and spilling
  • Can modify an existing implementation
  • Proposed implementation
    • Central Coherence Engine (CCE)
    • On-chip directory by duplicating tag arrays
information and data exchange
Information and Data Exchange
  • Singlet information
    • Directory detects and notifies the block owner
  • Sharing of clean data
    • PUTS: notify directory of clean data replacement
    • Directory sends forward request to the first sharer
  • Spilling
    • Currently implemented as a 2-step data transfer
    • Can be implemented as recipient-issued prefetch
outline2
Outline
  • Introduction
  • CMP Cooperative Caching
  • Hardware Implementation
  • Performance Evaluation
  • Conclusion
performance evaluation
Performance Evaluation
  • Full system simulator
    • Modified GEMS Ruby to simulate memory hierarchy
    • Simics MAI-based OoO processor simulator
  • Workloads
    • Multithreaded commercial benchmarks (8-core)
      • OLTP, Apache, JBB, Zeus
    • Multiprogrammed SPEC2000 benchmarks (4-core)
      • 4 heterogeneous, 2 homogeneous
  • Private / shared / cooperative schemes
    • Same total capacity/associativity
multithreaded workloads throughput
Multithreaded Workloads - Throughput
  • CC throttling - 0%, 30%, 70% and 100%
  • Ideal – Shared cache with local bank latency
multithreaded workloads avg latency
Multithreaded Workloads - Avg. Latency
  • Low off-chip miss rate
  • High hit ratio to local L2
  • Lower bandwidth needed than a shared cache
multiprogrammed workloads

Off-chip

Off-chip

Remote L2

Remote L2

Local L2

Local L2

L1

L1

Multiprogrammed Workloads
sensitivity varied configurations
Sensitivity - Varied Configurations
  • In-order processor with blocking caches

Performance Normalized to Shared Cache

a spectrum of related research

Hybrid Schemes(CMP-NUCA proposals, Victim replication, Speight et al. ISCA’05, CMP-NuRAPID, Fast & Fair, synergistic caching, etc)

… …

Cooperative Caching

Configurable/malleable Caches

(Liu et al. HPCA’04, Huh et al. ICS’05, cache partitioning proposals, etc)

A Spectrum of Related Research
  • Speight et al. ISCA’05
  • - Private caches
  • - Writeback into peer caches

Shared

Caches

Private

Caches

Best latency

Best capacity

conclusion
Conclusion
  • CMP cooperative caching
    • Exploit benefits of private cache based design
    • Capacity sharing through explicit cooperation
    • Cache replacement/placement policies for replication control and global management
    • Robust performance improvement

Thank you!

shared vs private caches for cmp

Private cache – best latency

    • High local hit ratio, no interference from other cores
  • Need to combine the strengths of both

Private

Shared

Desired

Shared vs. Private Caches for CMP
  • Shared cache – best capacity
    • No replication, dynamic capacity sharing

Duplication

cmp caching challenges opportunities
CMP Caching Challenges/Opportunities
  • Future hardware
    • More cores, limited on-chip cache capacity
    • On-chip wire-delay, costly off-chip accesses
  • Future software
    • Diverse sharing patterns, working set sizes
    • Interference due to capacity sharing
  • Opportunities
    • Freedom in designing on-chip structures/interfaces
    • High-bandwidth, low-latency on-chip communication
multi cluster cmps scalability
Multi-cluster CMPs (Scalability)

P

P

P

P

L1I

L1D

L1D

L1I

L1I

L1D

L1D

L1I

L2

L2

L2

L2

CCE

CCE

L2

L2

L2

L2

L1

L1D

L1D

L1

L1

L1D

L1D

L1

P

P

P

P

P

P

P

P

L1

L1D

L1D

L1

L1

L1D

L1D

L1

L2

L2

L2

L2

CCE

CCE

L2

L2

L2

L2

L1

L1D

L1D

L1

L1

L1D

L1D

L1

P

P

P

P

central coherence engine cce

P1

P8

… …

L1 Tag

Array

… …

L2 Tag

Array

4-to-1

mux

4-to-1

mux

State vector gather

State vector

Central Coherence Engine (CCE)

...

...

Network

Output queue

Directory

Memory

Spilling

buffer

Input queue

...

...

Network

cache network configurations
Cache/Network Configurations
  • Parameters
    • 128-byte block size; 300-cycle memory latency
    • L1 I/D split caches: 32KB, 2-way, 2-cycle
    • L2 caches: Sequential tag/data access, 15-cycle
    • On-chip network: 5-cycle per-hop
  • Configurations
    • 4-core: 2X2 mesh network
    • 8-core: 3X3 or 4X2 mesh network
    • Same total capacity/associativity for private/shared/cooperative caching schemes
ad