Interconnect-Aware Coherence Protocols for Chip Multiprocessors - PowerPoint PPT Presentation

liam
interconnect aware coherence protocols for chip multiprocessors n.
Skip this Video
Loading SlideShow in 5 Seconds..
Interconnect-Aware Coherence Protocols for Chip Multiprocessors PowerPoint Presentation
Download Presentation
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

play fullscreen
1 / 28
Download Presentation
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
424 Views
Download Presentation

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter University of Utah

  2. CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication Messages have different latency and bandwidth needs Heterogeneous wires 11% better performance 22.5% lower wire power Motivation: Coherence Traffic C1 C2 C3 Data Inv Ack L1 L1 L1 Read Req Inval Ex Req Fwd to owner L2 Messagesrelated to read miss Messages related to write miss

  3. Rd-Ex request from processor 1 Directory sends clean copy to processor 1 Directory sends invalidate message to processor 2 Cache 2 sends acknowledgement back to processor 1 Exclusive request for a shared copy Non-Critical Processor 1 Processor 2 4 Cache 1 Cache 2 2 3 1 Critical L2 & Directory

  4. Wire Characteristics • Wire Resistance and capacitance per unit length

  5. Design Space Exploration • Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth 

  6. Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing

  7. Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 4x

  8. Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions

  9. Directory Based Protocol (Write-Invalidate) • Map critical/small messages on L wires and non-critical messages on PW wires • Read exclusive request for block in shared state • Read request for block in exclusive state • Negative Ack (NACK) messages Exploit hop imbalance

  10. Read to an Exclusive Block Fwd Dirty Copy (critical) Proc 1 L1 ACK Proc 2 L1 Spec Reply (non-critical) Req Read Req WB Data L2 & Directory (non-critical)

  11. NACK Messages • NACK – Negative Acknowledgement generated when directory state is busy • Can employ MSHR id of the request instead of full address • Directory load is low • Requests can be served at next try • Sending NACK on L-Wires can improve performance • Directory load is high • Frequent back off and retry cycles • Sending NACK on PW-Wires can reduce power consumption

  12. Snoop Bus Based Protocol • Similar to bus-based SMP system • Signal wires and voting wires • Signal wires • To find the state of the block • Voting wires • To vote for owner of the shared data

  13. Protocol-Independent Techniques • Narrow bit-width operands for synchronization variables • Lock and barrier use small integers • Writeback data to PW-wires • Writeback messages are rarely on the critical path • Narrow messages to L-wires • Only contain src, dst, operand and MSHR_id • For example: reply for upgrade message

  14. Implementation Complexity • Heterogeneous interconnect incurs additional complexity • Cache coherence protocols • Robust enough to handle message re-ordering • Decision process • Interconnect implementation

  15. Complexity in the Decision Process • In the directory based system • Optimizations that exploit hop imbalance • Check directory state • Dynamic mapping of NACK messages • Track directory load • Narrow Messages • Compute the width of an operand

  16. Overhead in Interconnect Implementation • Additional Multiplexing/De-multiplexing at sender and receiver side • Additional latches required for power optimized wires • Power savings in PW-Wires goes down by 5% • Wire area overhead • Zero – Equal metal area for base and heterogeneous case

  17. Router Complexity Physical Channel 1 VC 1 Out 1 Crossbar VC 2 Out 2 Base Model

  18. Router Complexity L 24 bits L, PW, B PW Crossbar Out 1 64 bytes Out 2 B L, PW, B 32 bytes Each Physical channel is split into 3 channels (L, PW & B)

  19. Outline • Overview • Wire Design Space Exploration • Protocol-dependent Techniques • Protocol-independent Techniques • Results • Conclusions

  20. Evaluation Platform & Simulation Methodology Processor • Virtutech Simics Simulator • Sixteen-Core CMP • Ruby Timing model (GEMS) • NUCA cache architecture • MOESI Directory protocol • Benchmarks • SPLASH2 • Opal Timing model (GEMS) • Out-of-Order Processor • Multiple outstanding requests L2

  21. Wire Model ores Cside-wall V Wire RC M M M ocap Icap Cadj Ref: Banerjee et al. 65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

  22. Heterogeneous Interconnects • B – Wires • Request carrying address • Response that are on critical path • L- Wires (latency optimized) • Narrow Messages • Unblock & Write-Control Messages • NACK • PW-Wires (power optimized) • Writeback data • Response to read request for an exclusive block

  23. Performance Improvements Average improvement 11%

  24. Percentage of Critical/Noncritical Messages PW Wire Traffic 13% L Wire Traffic 40% Performance 11% Power Saving in wire 22.5%

  25. Power Savings in Wires

  26. L-Message Distribution Narrow Msgs Unblock & Ctrl Hop Imbalance

  27. Sensitivity Analysis • Impact of out-of-order core • Average speedup 9.3% • Partial simulation (only 100M instructions) • OOO core is more tolerant to long latency operations • Link Bandwidth & Routing Algorithm • Benchmarks with high link utilization are very sensitive to bandwidth change • Deterministic routing incurs 3% performance loss compared to adaptive routing

  28. Conclusions • Coherence messages have diverse needs • Intelligent mapping of messages to heterogeneous wires can improve performance and power • Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks • Non-critical traffic on power optimized network decreases wire power by 22.5%