Cooperative Caching for Chip Multiprocessors

Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison

Review of CPU caching • Block size • Direct vs. n-way set associative vs. fully associative • Multiple layers (L1, L2, L3, …) • Harvard Architecture (Data/Instruction L1) • Three C’s (Compulsory, Capacity, Conflict) • Write policy (write through, write back) • Use of virtual vs. physical addresses • Replacement policies (LRU, random, etc.) • Special types (TLB, Victim) • Prefetching

Review of Cache Coherency • Snooping vs. directory-based • MSI, MESI, MOSI, MOESI • Can often transfer dirty data from cache to cache • Clean data is more difficult because it does not have an “owner” • Inclusion vs Exclusion • Performance issues related to coherency • Mutex example

Chip-level Multithreading (CMP) • Multiple CPU cores on a single chip • Different from hardware multithreading (MT) • Fine-grained, Course-grained, SMT • Becoming popular in industry with Intel Core 2 Duo, AMD X2, UltraSPARC T1, IBM Xenon, Cell • A common memory architecture is an L1 cache per core and a shared L2 for all cores on chip • Each core can use entire L2 cache • Another organization is private L2 caches per core • Lower latency to L2 cache and simpler design • L2 cache contention can become a problem for memory bound threads

Goals of CMP Caching • Must be scalable!!! • Reduce off-chip transactions • Expensive and getting worse • Reduce side effects between cores • They are running different computations and should not severely effect their neighbors if memory bound • Reduce latency • The main goal of caching • Latency of shared on-chip cache becomes a problem for high clock speeds

Cooperative Caching • Each core has its own private L2 cache but there is additional logic in the cache controller protocol to allow the private L2 caches to act as an aggregate cache for the chip. • Goal is to achieve both the low latency of private L2 caches and the low off-chip miss rate of shared L2 caches. • Adopted from file server and web caches (where remote operations are expensive)

Methods of Cooperative Caching • Private L2 cache is the baseline • Reduce off chip accesses • Victim data does not get written off chip • It is placed in a neighbor’s cache (capacity stealing) • Did not apply to old SMP systems • Talking to a neighbor process as expensive as talking to memory • Not true for CMP • Can dynamically control the amount of cooperation

Reducing off chip accesses • Cache-to-cache transfers of clean data • Most cache coherence protocols do not allow for this • Dirty data can be transferred cache-to-cache because it has a known owner • Clean data may be in more than one place, therefore it complicates the coherence protocol to assign an owner to clean data • Result is that clean data transfers for coherence often go through the next level down (SLOW!) • Extra complexity is worth it in CMP because going to the next level in the memory hierarchy is expensive • They claim sharing clean data can result in a 10-50% reduction in off-chip misses

Reducing off chip accesses • Replication aware data replacement • Private cache method results in multiple copies of the same data • When selecting a victim, picking something that has a duplicate on chip (a “replicate”) is better than picking something that is unique on the chip (a “singlet”) • The reason is that if the singlet is needed again in the future, then an off-chip access is required to get it back • Must complicate coherence protocol to keep track of replicates and singlets • Once again, they claim it is worth it for CMP • If all potential victims are singlet, they use LRU • Victims can spill to a neighbor cache using a weighted random selection algorithms that favors nearest neighbors

Reducing off chip accesses • Global replacement of inactive data • Want something like LRU for aggregate caches • Difficult to implement because each cache is technically private leading to synchronization problems • They use N-chance forwarding to handle global replacement policy (bottom of page 4) • Each block has a recirculation count • When a singlet block is selected, its recirculation count is set to N • Each time that block is evicted, its recirculation count is decremented • When the count reaches 0, it goes to main memory • When the block is accessed, recirculation count is reset • They have N=1 for CMP Cooperative Caching simulations

Cooperative throttling • Can modify cooperation probability parameter in spill algorithm to throttle amount of cooperation • Picks between replication aware method and basic LRU method • Lower cooperation probability means the caches act more like private L2s • High cooperation probability means the caches act more like shared L2

Hardware implementation • Need extra bits to keep track of state needed on previous slides • Singlet bit • Recirculation count • Spilling method can be push or pull • Push sends victim data directly to other cache • Pull sends a request to other cache and then it performs a read operation • Snooping requires too much overhead for monitoring private caches • They choose a centralized directory based protocol similar to MOSI • Might have scaling issues • They speculate having clusters of directories is a solution to scaling to hundreds of cores, but do not go any deeper

Central Coherence Engine (CCE) • Holds the directory and other centralized coherence logic • Ever read miss sent to directory, directory says which private cache has data • Must keep track of L1 and L2 tags due to non-inclusion between L1 and local L2 for each core • Inclusion is between L1 and the aggregate cache instead

CCE (Continued) • Picks a clean owner for a block to handle the cache-to-cache transfer of clean data • CCE must be updated when a clean copy is evicted from a private L2 • Implements push-based spilling by working with private caches • Write back from cache 1 transfers data to L2, CCE then picks a new cache for data and transfers it to new host cache

Performance evaluation • Go over section 4 of paper

Related Work • CMP-NUCA • CMP-NuRAPID • Victim Replication

Conclusion • Cooperative caching can reduce the runtime of simulated workloads by 4-38%, and performs at worst 2.2% slower than the best of private and shared caches in extreme cases. • Power utilization (by turning off private L2s) and performance isolation (reducing side effects between cores) are left as future work

Cooperative Caching for Chip Multiprocessors