Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter University of Utah

CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication Messages have different latency and bandwidth needs Heterogeneous wires 11% better performance 22.5% lower wire power Motivation: Coherence Traffic C1 C2 C3 Data Inv Ack L1 L1 L1 Read Req Inval Ex Req Fwd to owner L2 Messagesrelated to read miss Messages related to write miss

Rd-Ex request from processor 1 Directory sends clean copy to processor 1 Directory sends invalidate message to processor 2 Cache 2 sends acknowledgement back to processor 1 Exclusive request for a shared copy Non-Critical Processor 1 Processor 2 4 Cache 1 Cache 2 2 3 1 Critical L2 & Directory

Wire Characteristics • Wire Resistance and capacitance per unit length

Design Space Exploration • Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth 

Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing

Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 4x

Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions

Directory Based Protocol (Write-Invalidate) • Map critical/small messages on L wires and non-critical messages on PW wires • Read exclusive request for block in shared state • Read request for block in exclusive state • Negative Ack (NACK) messages Exploit hop imbalance

Read to an Exclusive Block Fwd Dirty Copy (critical) Proc 1 L1 ACK Proc 2 L1 Spec Reply (non-critical) Req Read Req WB Data L2 & Directory (non-critical)

NACK Messages • NACK – Negative Acknowledgement generated when directory state is busy • Can employ MSHR id of the request instead of full address • Directory load is low • Requests can be served at next try • Sending NACK on L-Wires can improve performance • Directory load is high • Frequent back off and retry cycles • Sending NACK on PW-Wires can reduce power consumption

Snoop Bus Based Protocol • Similar to bus-based SMP system • Signal wires and voting wires • Signal wires • To find the state of the block • Voting wires • To vote for owner of the shared data

Protocol-Independent Techniques • Narrow bit-width operands for synchronization variables • Lock and barrier use small integers • Writeback data to PW-wires • Writeback messages are rarely on the critical path • Narrow messages to L-wires • Only contain src, dst, operand and MSHR_id • For example: reply for upgrade message

Implementation Complexity • Heterogeneous interconnect incurs additional complexity • Cache coherence protocols • Robust enough to handle message re-ordering • Decision process • Interconnect implementation

Complexity in the Decision Process • In the directory based system • Optimizations that exploit hop imbalance • Check directory state • Dynamic mapping of NACK messages • Track directory load • Narrow Messages • Compute the width of an operand

Overhead in Interconnect Implementation • Additional Multiplexing/De-multiplexing at sender and receiver side • Additional latches required for power optimized wires • Power savings in PW-Wires goes down by 5% • Wire area overhead • Zero – Equal metal area for base and heterogeneous case

Router Complexity Physical Channel 1 VC 1 Out 1 Crossbar VC 2 Out 2 Base Model

Router Complexity L 24 bits L, PW, B PW Crossbar Out 1 64 bytes Out 2 B L, PW, B 32 bytes Each Physical channel is split into 3 channels (L, PW & B)

Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions

Evaluation Platform & Simulation Methodology Processor • Virtutech Simics Simulator • Sixteen-Core CMP • Ruby Timing model (GEMS) • NUCA cache architecture • MOESI Directory protocol • Benchmarks • SPLASH2 • Opal Timing model (GEMS) • Out-of-Order Processor • Multiple outstanding requests L2

Wire Model ores Cside-wall V Wire RC M M M ocap Icap Cadj Ref: Banerjee et al. 65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

Heterogeneous Interconnects • B – Wires • Request carrying address • Response that are on critical path • L- Wires (latency optimized) • Narrow Messages • Unblock & Write-Control Messages • NACK • PW-Wires (power optimized) • Writeback data • Response to read request for an exclusive block

Performance Improvements Average improvement 11%

Percentage of Critical/Noncritical Messages PW Wire Traffic 13% L Wire Traffic 40% Performance 11% Power Saving in wire 22.5%

Power Savings in Wires

L-Message Distribution Narrow Msgs Unblock & Ctrl Hop Imbalance

Sensitivity Analysis • Impact of out-of-order core • Average speedup 9.3% • Partial simulation (only 100M instructions) • OOO core is more tolerant to long latency operations • Link Bandwidth & Routing Algorithm • Benchmarks with high link utilization are very sensitive to bandwidth change • Deterministic routing incurs 3% performance loss compared to adaptive routing

Conclusions • Coherence messages have diverse needs • Intelligent mapping of messages to heterogeneous wires can improve performance and power • Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks • Non-critical traffic on power optimized network decreases wire power by 22.5%

Interconnect-Aware Coherence Protocols for Chip Multiprocessors