Cooperative Caching for Chip Multiprocessors

Cooperative Caching forChip Multiprocessors Jichuan Chang†, Enric Herrero‡, Ramon Canal‡ and Gurindar S. Sohi* HP Labs† Universitat Politècnica de Catalunya‡ University of Wisconsin-Madison* M. S. Obaidat and S. Misra (Editors)Chapter 13, Cooperative Networking (Wiley)

Outline • Motivation • Cooperative Caching for CMPs • Applications of CMP Cooperative Caching • Latency Reduction • Adaptive Repartitioning • Performance Isolation • Conclusions

P P P P P P P P Motivation - Background • Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs • Critical for CMPs • Processor/memory gap • Limited pin-bandwidth • Current designs • Shared cache • sharing can lead to contention • Private caches • isolation can waste resources P P Mem. $ Narrow P P Slow 4-core CMP Capacitycontention Wastedcapacity

Motivation – Challenges • Key challenges • Growing on-chip wire delay • Expensive off-chip accesses • Destructive inter-thread interference • Diverse workload characteristics • Three important demands for CMP caching • Capacity: reduce off-chip accesses • Latency: reduce remote on-chip references • Isolation: reduce inter-thread interference • Need to combine the strength of both private and shared cache designs

P P L1I L1D L1I L1D L2 L2 Network L2 L2 L1I L1D L1I L1D P P CMP Cooperative Caching • Form an aggregate global cache via cooperative private caches • Use private caches to attract data for fast reuse • Share capacity through cooperative policies • Throttle cooperation to find an optimal sharing point • Inspired by cooperative file/web caches • Similar latency tradeoff • Similar algorithms

“Private” L2 caches to reduce access latency. Centralized directory with duplicated tags grants coherence on-chip. Spilling: Evicted blocks are forwarded to other caches for a more efficient use of cache space. (N-Chance forwarding mechanism) CMP Cooperative Caching Main Memory Bus CCE Interconnect L1 L1 L1 L2 L2 P P

Objective: Keep the benefits of Cooperative caching while improving scalability and energy consumption. Distributed directory with different tag allocation mechanism. Distributed Cooperative Caching Main Memory Bus DCE DCE Interconnect L1 L1 L1 L1 L2 L2 P P

Tag Structure Comparison

Several techniques have appeared that take advantage of Cooperative Caching for CMPs. For Latency Reduction Cooperation Throtling For Adaptive Repartitioning Elastic Cooperative Caching For Performance Isolation Cooperative Cache Partitioning 3. Applications of CMP Cooperative Caching

Policies to Reduce Off-chip Accesses • Cooperation policies for capacity sharing • (1) Cache-to-cache transfers of clean data • (2) Replication-aware replacement • (3) Global replacement of inactive data • Implemented by two unified techniques • Policies enforced by cache replacement/placement • Information/data exchange supported by modifying the coherence protocol

Policy (1) - Make use of all on-chip data • Don’t go off-chip if on-chip (clean) data exist • Beneficial and practical for CMPs • Peer cache is much closer than next-level storage • Affordable implementations of “clean ownership” • Important for all workloads • Multi-threaded: (mostly) read-only shared data • Single-threaded: spill into peer caches for later reuse

Policy (2) – Control replication • Intuition – increase # of unique on-chip data • Latency/capacity tradeoff • Evict singlets only when no replications exist • Modify the default cache replacement policy • “Spill” an evicted singlet into a peer cache • Can further reduce on-chip replication • Randomly choose a recipient cache for spilling “singlets”

Policy (3) - Global cache management • Approximate global-LRU replacement • Combine global spill/reuse history with local LRU • Identify and replace globally inactive data • First become the LRU entry in the local cache • Set as MRU if spilled into a peer cache • Later become LRU entry again: evict globally • 1-chance forwarding (1-Fwd) • Blocks can only be spilled once if not reused

CC 0% Policy (1) Cooperation Throttling • Why throttling? • Further tradeoff between capacity/latency • Two probabilities to help make decisions • Cooperation probability: control replication • Spill probability: throttle spilling Shared Private CC 100% Cooperative Caching

Single-threaded SPECOMP Performance Evaluation

CC for Adaptive Repartitioning • Tradeoff between minimizing off-chip misses and avoiding inter-thread interference. • Elastic Cooperative Caching adapts caches according to application requirements. • High cache requirements -> Big local private cache • Low data reuse -> Cache space reassigned to increase the size of the global shared cache for spilled blocks.

Elastic Cooperative Caching structure Local Shared/Private cache with independent repartitioning unit. Distributed Coherence Engines grant coherence Allocates evicted blocks from all private regions Only local core can allocate Every N cycles repartitions cache based on LRU hits in S&P partitions. Distributes evicted blocks from private partition among nodes.

Request Hit Request Hit Repartition Cycle Repartitioning Unit Working Example Counter > HT Private ++ Counter < LT Shared ++ Independent Partitioning, distributed structure. 3 4

Spilled Allocator Working Example Only data from the private region can be spilled Private Ways Shared Ways Broadcast No need of perfectly updated information, out of critical path.

Performance and energy-efficiency evaluation +24% Over ASR +12% Over ASR

Time-share Based Partitioning • Throughput-fairness dilemma • Cooperation: Taking turns to speed up • Multiple time-sharing partitions (MTP) • QoS guarantee • Cooperatively shrink/expand across MTPs • Bound average slowdown over the long term P4 P4 P1 P1 P2 P3 P2 P3 Time Time IPC = 0.52 WS=2.42 QoS = -0.52 FS = 1.22 IPC = 0.52 WS = 2.42 QoS = 0 FS = 1.97 ==<< Fairness improvement and QoS guarantee reflected by higher FS and bounded QoS values

MTP Benefits • Better than single spatial partition (SSP) • MTP/long-termQoS almost the same as MTP/noQoS Fair Speedup Percentage of workloads achieving various FS values Offline analysis based on profile info, 210 workloads (job mixes)

MTP issues Not needed if LRU performs better (LRU often near-optimal [Stone et al. IEEE TOC ’92]) Partitioning is more complex than SSP Cooperative Cache Partitioning (CCP) Integration with Cooperative Caching (CC) Exploit CC’s latency and LRU-based sharing benefits Simplify the partitioning algorithm Total execution time = Epochs(CC) + Epochs(MTP) Weighted by # of threads benefiting from CC vs. MTP Better than MTP

C_expand Speedup Slowdown Partitioning Heuristic • When is MTP better than CC • QoS: ∑speedup > ∑slowdown (over N partitions) • Speedup should be large • CC already good at fine-grained tuning C_shrink Baseline Throughput (normalized) thrashing _test Speedup > (N-1) x Slowdown Allocated cache ways (16-way total, 4-core)

Partitioning Algorithm • S = All threads - supplier threads (e.g., gcc, swim) • Allocate them with gPar (guaranteed partition, or min. capacity needed for QoS) [Yeh/Reinman CASES ’05] • For threads in S, init their C_expand and C_shrink • Do thrashing_test iteratively for each thread in S • If thread t fails, allocate t with gPar, remove t from S • Update C_expand and C_shrink for other threads in S • Repeat until S is empty or all threads in S pass the test

Fair Speedup Results • Two groups of workloads • PAR: MTP better than CC (partitioning helps) • LRU: CC better than MTP (partitioning hurts) Percentage of workloads Percentage of workloads LRU (143 out of 210 workloads) PAR (67 out of 210 workloads)

Performance and QoS evaluation

4. Conclusions • On-chip cache hierarchy plays an important role in CMPs and must provide fast and fair data accesses for multiple, competing processor cores. • Cooperative Caching for CMPs is an effective framework to manage caches in such environment. • Cooperative sharing mechanisms and the philosophy of using cooperation for conflict resolution can be applied to many other resource management problems.

Cooperative Caching for Chip Multiprocessors