Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Fair Cache Sharing and Partitioningin a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer Engineering North Carolina State University {skim16, dchandr, solihin}@ncsu.edu

Cache Sharing in CMP Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ …… Seongbeom Kim, NCSU

Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Seongbeom Kim, NCSU

Shared L2 cache space contention Seongbeom Kim, NCSU

time slice t1 t2 t3 t1 t4 time slice t1 t1 t1 t1 t1 t3 t3 t2 t2 t4 Impact of unfair cache sharing • Uniprocessor scheduling • 2-core CMP scheduling • Problems of unfair cache sharing • Sub-optimal throughput • Thread starvation • Priority inversion • Thread-mix dependent throughput • Fairness: uniform slowdown for co-scheduled threads P1: P2: Seongbeom Kim, NCSU

Contributions • Cache fairness metrics • Easy to measure • Approximate uniform slowdown well • Fair caching algorithms • Static/dynamic cache partitioning • Optimizing fairness • Simple hardware modifications • Simulation results • Fairness: 4x improvement • Throughput • 15% improvement • Comparable to cache miss minimization approach Seongbeom Kim, NCSU

Related Work • Cache miss minimization in CMP: • G. Suh, S. Devadas, L. Rudolph, HPCA 2002 • Balancing throughput and fairness in SMT: • K. Luo, J. Gummaraju, M. Franklin, ISPASS 2001 • A. Snavely and D. Tullsen, ASPLOS, 2000 • … Seongbeom Kim, NCSU

Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Conclusions Seongbeom Kim, NCSU

Fairness Metrics • Uniform slowdown Execution time of ti when it runs alone. Seongbeom Kim, NCSU

Fairness Metrics • Uniform slowdown Execution time of ti when it shares cache with others. Seongbeom Kim, NCSU

Fairness Metrics • Uniform slowdown • We want to minimize: • Ideally: Seongbeom Kim, NCSU

LRU LRU LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Seongbeom Kim, NCSU

LRU LRU LRU LRU LRU LRU * * LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Current Partition Target Partition P1: 384B P1: 384B P2: 640B P2: 640B Seongbeom Kim, NCSU

MissRate shared P1: P2: Repartitioning interval Target Partition P1: P2: Dynamic Fair Caching Algorithm MissRate alone Ex) Optimizing M3 metric P1: P2: Seongbeom Kim, NCSU

MissRate shared MissRate shared P1: P1:20% P2: P2:15% Repartitioning interval Target Partition P1:256KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% Seongbeom Kim, NCSU

MissRate shared P1:20% P2:15% Repartitioning interval Target Partition Target Partition P1:256KB P1:192KB P2:320KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 15% / 5% Partition granularity: 64KB Seongbeom Kim, NCSU

MissRate shared MissRate shared MissRate shared P1:20% P1:20% P1:20% P2:10% P2:15% P2:15% Repartitioning interval Target Partition P1:192KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% Seongbeom Kim, NCSU

MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Repartitioning interval Target Partition Target Partition P1:128KB P1:192KB P2:384KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 10% / 5% Seongbeom Kim, NCSU

MissRate shared MissRate shared MissRate shared P1:20% P1:20% P1:25% P2:10% P2: 9% P2:10% Repartitioning interval Target Partition P1:128KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% Seongbeom Kim, NCSU

MissRate shared MissRate shared P1:25% P1:20% P2:10% P2: 9% Repartitioning interval Target Partition Target Partition P1:192KB P1:128KB P2:320KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% Seongbeom Kim, NCSU

Fair Caching Overhead • Partitionable cache hardware • Profiling • Static profiling for M1, M3 • Dynamic profiling for M1, M3, M4 • Storage • Per-thread registers • Miss rate/count for “alone” case • Miss rate/count for “shared’ case • Repartitioning algorithm • < 100 cycles overhead in 2-core CMP • invoked at every repartitioning interval Seongbeom Kim, NCSU

Evaluation Environment • UIUC’s SESC Simulator • Cycle accurate Seongbeom Kim, NCSU

Evaluation Environment • 18 benchmark pairs • Algorithm Parameters • Static algorithms: FairM1 • Dynamic algorithms: FairM1Dyn, FairM3Dyn, FairM4Dyn Seongbeom Kim, NCSU

Outline • Fairness Metrics • Static Fair Caching Algorithms (See Paper) • Dynamic Fair Caching Algorithms • Evaluation Environment • Evaluation • Correlation results • Static fair caching results • Dynamic fair caching results • Impact of rollback threshold • Impact of time interval • Conclusions Seongbeom Kim, NCSU

Correlation Results Seongbeom Kim, NCSU

Correlation Results M1 & M3 show best correlation with M0. Seongbeom Kim, NCSU

Static Fair Caching Results Seongbeom Kim, NCSU

Static Fair Caching Results FairM1 has comparable throughput as MinMiss with better fairness Seongbeom Kim, NCSU

Static Fair Caching Results Opt assures that better fairness is achieved without throughput loss. Seongbeom Kim, NCSU

Dynamic Fair Caching Results Seongbeom Kim, NCSU

Dynamic Fair Caching Results FairM1Dyn, FairM3Dyn show best fairness and throughput. Seongbeom Kim, NCSU

Dynamic Fair Caching Results Improvement in fairness results in throughput gain. Seongbeom Kim, NCSU

Dynamic Fair Caching Results Fair caching sometimes degrades throughput (2 out of 18). Seongbeom Kim, NCSU

Impact of Rollback Threshold in FairM1Dyn Seongbeom Kim, NCSU

Impact of Rollback Threshold in FairM1Dyn ’20% Trollback’ shows best fairness and throughput. Seongbeom Kim, NCSU

Impact of Repartitioning Interval in FairM1Dyn Seongbeom Kim, NCSU

Impact of Repartitioning Interval in FairM1Dyn ‘10K L2 accesses’ shows best fairness and throughput. Seongbeom Kim, NCSU

Conclusions • Problems of unfair cache sharing • Sub-optimal throughput • Thread starvation • Priority inversion • Thread-mix dependent throughput • Contributions • Cache fairness metrics • Static/dynamic fair caching algorithms • Benefits of fair caching • Fairness: 4x improvement • Throughput • 15% improvement • Comparable to cache miss minimization approach • Fair caching simplifies scheduler design • Simple hardware support Seongbeom Kim, NCSU

Partitioning Histogram Mostly oscillating between two partitioning choices. Seongbeom Kim, NCSU

Partitioning Histogram Trollback of 35% can still find better partition. Seongbeom Kim, NCSU

Impact of Partition Granularity in FairM1Dyn 64KB shows best fairness and throughput. Seongbeom Kim, NCSU

Impact of Initial Partition in FairM1Dyn Tolerable differences from various initial partition. Seongbeom Kim, NCSU

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Presentation Transcript

Multiprocessor Architecture

Multiprocessor Cache Coherency

Single-Chip Multiprocessor

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems

CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

Scaling and Packing on a Chip Multiprocessor

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

Design of a Custom VEE Core in a Chip Multiprocessor

Multiprocessor Cache Consistency

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Multiprocessor Cache Coherency

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming