Cache coherence for CMPs

Cache coherence for CMPs Miodrag Bolic

Private cache • Each cache bank is private to a particular core • Cache coherence is maintained at the L2 cache level • Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]

Private cache Advantages Disadvantages Data blocks can get duplicated if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized • Short L2 cache access latency • Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited.

Shared cache • Cache coherence is maintained at the L1 level • Bits usually chosen for the mapping to a particular bank are the less significant ones • Piranha [16], Hydra [47], Sun UltraSPARC T2 [105] and Intel Merom [104]

Shared caches Advantage Disadvantages Many requests will be will be serviced by remote banks (L2 NUCA architecture) • Single copy of blocks • Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented.

Hammer protocol • AMD - Opteron systems • It relies on broadcasting requests to all tiles to solve cache misses • It targets systems that use unordered point-to-point interconnection networks • On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block • All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core. • The requesting core needs • to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.

Hammer protocol Disadvantages • Requires three hops in the critical path before the requested data block is obtained. • Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.

Directory protocol • In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block. • In tiled CMPs, the directory structure is split into banks which are distributed across the tiles. • Each directory bank tracks a particular range of memory blocks.

Directory protocol • The indirection problem • every cache miss must reach the home tile before any coherence action can be performed. • adds unnecessary hops into the critical path of the cache misses • The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations. • Example: block size 16 bytes, 64 tiles

Comparison of protocols

Interleaving

Mapping between cache entries and directory entries • One way to keep constant the size of the directory entries is storing duplicate tags.

Cache coherence for CMPs

Cache coherence for CMPs

Presentation Transcript

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

Extra Cache Coherence Examples

Directory-Based Cache Coherence

Cache Coherence Schemes for Multiprocessors

Cache Coherence for GPU Architectures

Cache coherence

ACCESS: Smart Scheduling for Asymmetric Cache CMPs

Cache Coherence

Token Coherence for CMPs

The Cache-Coherence Problem

Cache Coherence

Cache coherence, etc… - MIMD –

Cache Coherence Protocols

The Cache-Coherence Problem

Power Efficient Cache Coherence

Directory-based Cache Coherence

Cache Coherence

The Cache-Coherence Problem

Example Cache Coherence Problem

Cache Coherence Techniques for Multicore Processors

The Cache-Coherence Problem