1 / 22

The Power of Priority : NoC based Distributed Cache Coherency

EE Department Technion, Haifa, Israel. The Power of Priority : NoC based Distributed Cache Coherency. Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny. QNoC Research Group Technion. Chip Multi-Processor (CMP). Multi-Core Large cache Shared cache Distributed cache

yehudi
Download Presentation

The Power of Priority : NoC based Distributed Cache Coherency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE Department Technion, Haifa, Israel The Power of Priority:NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny QNoC Research GroupTechnion

  2. Chip Multi-Processor (CMP) Multi-Core Large cache Shared cache Distributed cache NoC-based: How? Dual-Core Monolithic shared cache

  3. 100 Global Wires Delay Global wire delay • Global wires delay 10 • Distance reached in single cycle • Today: ~25% of chip • In 10 years: ~1% of chip 1 Gate delay 0.1 250 250 180 130 90 65 45 32 250 Source: ITRS 2003 Fraction of chip reachable in 1 clock cycle Source: Keckler et al. ISSCC 2003 Future Cache - Physics Perspective • Large cache  Large access time Large monolithic cache is not scalable

  4. NUCA - Non Uniform Cache Architecture Banked cache over NoC • Smaller bank  Smaller Access Time • Multiple banks  Multiple Ports • Closer bank  Smaller Access Time NUCA= Non uniform access times Cache-line placement policy • Static NUCA (SNUCA) • Dynamic NUCA (DNUCA) Sources: Kim et al. ASPLOS 2002 Beckmann et al. MICRO 2004

  5. Issues in NUCA-based CMP • NoC performance  CMP performance • Cache coherency and transaction order (correctness) • Search (in DNUCA) • Different traffic types (e.g. fetch vs. prefetch) • Synchronization (locks) NoC Services for CMP?

  6. Cache bank with distributed directory Cache Coherency over NoC How do we maintain coherency over NoC? • Distributed directory • Snooping • Central directory

  7. Ctrl. packet Data packet Distributed Cache Coherency Cache access  Multiple NoC transactions Example: Simple read transaction

  8. Ctrl. packet Data packet Read Transaction of Modified Block

  9. Ctrl. packet Data packet Read Exclusive of Shared Block

  10. Vanilla NoC Basic NoC to Support CMP Off-the-shelf (Vanilla) NoC: • Grid of wormhole routers • Unicast only • Ordering in network • Static routing • No virtual channels • Smart interfaces Can We Do Better?

  11. Observations: L2 Access A) Delay = Queueing + NoC transactions B) All NoC transactions are equally important • C) NoC transactions consist of: • Short ctrl. packets • Long data packets Idea: Differentiate between Ctrl. and Data • Solution: Preemptive Priority NoC •  Give priority to short ctrl. packets

  12. Preemptive Priority NoC: QNoC QNoC Multiple SL Router Service Levels: • Dedicated wormhole buffer • Preemptive priority scheduling Multiple SL link

  13. Transaction 1 Long Data Transaction 2 Short Req. Long Resp. Example: Vanilla NoC Without contention: X:Delay of long packet δ:Delay of short packet Vanilla NoC example Blue delay ~X Red delay ~ 2X+δ Average delay ~ 1.5X A B

  14. Transaction 1 Long Data Transaction 2 Short Req. Long Resp. Example: Priority NoC Without contention: X:Delay of long packet δ:Delay of short packet Vanilla NoC example Blue delay=X Red delay = 2X+δ Average delay ~ 1.5X A B Priority NoC example Blue delay= X+δ Red delay = X+δ Average delay ~ X Potential delay reduction ~ 0.5X

  15. Priority NoC: Different Destinations • Very important in wormhole • When ctrl. packet is blocked by other worms Long Data Short Req.

  16. Protocol Correctness Need state-preserving serialization of transactions in the processor interface

  17. Numerical Evaluation • CMP simulator (SIMICS) • Simulate parallel benchmarks • Obtain L2-cache access traces • QNoC simulator (OPNET) • Simulate distributed coherence protocol over NoC • Measure total RD/RX L2-access delay • Measure total program throughput

  18. Priority NoC: Results Delay Reduction vs. Network Load RD Delay - Apache RD/RX Delay Reduction - Apache • Short ctrl. packet gets high priority • Long data packet gets low priority

  19. Priority NoC: Several Benchmarks Delay Reduction Program Speedup

  20. So Far: The Power of Priority • Simplicity - Almost for Free • Significant CMP Speed-up • Good For: • Coherency • Traffic differentiation (e.g. Fetch vs. Pre-Fetch) • Search in DNUCA • Synchronization (Locks)

  21. Advanced Support Functions • Special Broadcast for Short Messages • Broadcast service (e.g. search in DNUCA) • Wormhole broadcast slow and expensive S&F broadcast embedded in wormhole • Virtual Ring • No Additional Cost • For Invalidation Multicast • Snooping or synchronization

  22. Summary • NoC at CMP Service! • Shared cache over NoC • Priority is powerful • Built-in support functions

More Related