1 / 37

Improving Multiple-CMP Systems with Token Coherence

Improving Multiple-CMP Systems with Token Coherence. Mike Marty 1 , Jesse Bingham 2 , Mark Hill 1 , Alan Hu 2 , Milo Martin 3 , and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun. Summary.

teddy
Download Presentation

Improving Multiple-CMP Systems with Token Coherence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Multiple-CMP Systems with Token Coherence Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun

  2. Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [ISCA 2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

  3. Outline • Motivation and Background • Coherence in Multiple-CMP Systems • Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation

  4. Coherence in Multiple-CMP Systems P P P P I I D I D D I D interconnect L2 L2 L2 L2 • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs CMP 2 CMP 1 interconnect CMP 3 CMP 4

  5. Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity • explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4

  6. Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but • Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4

  7. Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack data/ ack getx WB getx inv ack inv ack inv fwd ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [M I] B: [S O] getx Memory/Directory Memory/Directory

  8. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Safety • Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation

  9. Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B • Each memory block initialized with T tokens • Tokens stored in memory, caches, & messages • At least one token to read a block • All tokens to write a block P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3

  10. Extending to Multiple-CMP System CMP 0 CMP 1 P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1

  11. Extending to Multiple-CMP System CMP 0 CMP 1 • Token counting remains flat • Tokens to caches • Handles shared caches and other complex hierarchies Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  12. Starvation Avoidance GETX GETX GETX CMP 0 CMP 1 • Tokens move freely in the system • Transient requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  13. Starvation Avoidance CMP 0 CMP 1 • Solution: issue Persistent Request • Heavyweight request guaranteed to succeed • Methods: Centralized [2003] and Distributed (New) Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  14. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests Store B Store B Store B timeout timeout timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

  15. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests • Arbiter orders and broadcasts activate Store B Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P0 B: P0 B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

  16. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) • Bottom Line: handoff is 3 message latencies Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P2 B: P0 B: P2 B: P0 B: P2 B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P2 B: P0 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1

  17. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B

  18. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B

  19. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) • Processors broadcast deactivate Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B

  20. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B

  21. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation

  22. Hierarchical for Performance: TokenCMP • Target System: • 2-8 CMPs • Private L1s, shared L2 per CMP • Any interconnect, but high-bandwidth • Performance Policy Goals: • Aggressively acquire tokens • Exploit on-chip locality and bandwidth • Respect cache hierarchy • Detecting and handling missed tokens

  23. Hierarchical for Performance: TokenCMP • Approach: • On L1 miss, broadcast within own CMP • Local cache responds if possible • On L2 miss, broadcast to other CMPs • Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter • Responses between CMPs carry extra tokensfor future locality • Handling missed tokens: • Timeout after average memory latency • Invoke persistent request (no retries) • Larger systems can use filters, multicast, soft-state directories

  24. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation • Model checking • Performance w/ commercial workloads • Robustness

  25. TokenCMP Evaluation • Simple? • Model checking • Fast? • Full-system simulation w/ commercial workloads • Robust? • Micro-benchmarks to simulate high contention

  26. Complexity Evaluation with Model Checking • Methods: • TLA+ and TLC • DirectoryCMP omits all intra-CMP details • TokenCMP’s correctness substrate modeled • Result: • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP • Correctness Substrate verified to be correct and deadlock-free • Small configuration, varied parameters • All possible performance protocols correct

  27. Performance Evaluation • Target System: • 4 CMPs, 4 procs/cmp • 2GHz OoO SPARC, 8MB shared L2 per chip • Directly connected interconnect • Methods: Multifacet GEMS simulator • Simics augmented with timing models • Released soon: http://www.cs.wisc.edu/gems • ISCA 2005 Tutorial! • Benchmarks: • Performance: Apache, Spec, OLTP • Robustness: Locking uBenchmark

  28. Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP

  29. Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

  30. Full-system Simulation: Traffic • TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system

  31. Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention

  32. Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention

  33. Performance Robustness Locking micro-benchmark less contention more contention

  34. Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Protocol Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

  35. Full-system Simulation: Traffic

  36. Full-system Simulation: Intra-CMP Traffic

More Related