1 / 26

Lecture 21: Coherence and Interconnection Networks - Flexible Snooping and Forwarding

This lecture discusses the concepts of coherence and interconnection networks in computer systems with a focus on flexible snooping and forwarding. It explores different protocols and their pros and cons, as well as the implementation and scalability issues. The lecture also covers new approaches such as in-network cache coherence and virtual networks.

higginsc
Download Presentation

Lecture 21: Coherence and Interconnection Networks - Flexible Snooping and Forwarding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 21: Coherence and Interconnection Networks • Papers: • Flexible Snooping: Adaptive Filtering and Forwarding • in Embedded Ring Multiprocessors, UIUC, ISCA-06 • Coherence-Ordering for Ring-Based Chip • Multiprocessors, Wisconsin, MICRO-06 (very brief) • In-Network Cache Coherence, Princeton, MICRO-06

  2. Motivation • CMPs are the standard • cheaper to build medium size machines • 32 to 128 cores • shared memory, cache coherent • easier to program, easier to manage • cache coherence is a necessary evil

  3. strategy ordered network? pros cons snoopy broadcast bus yes simple difficult to scale directory based protocol no scalable indirection, extra hardware In-Network Directory snoopy embedded ring no no simple scalable long latencies Complexity Cache coherence solutions

  4. Ring Interconnect based snoop protocols • Investigated by Barroso et al. in early nineties • Why? • Short, fast point-to-point link • Fewer (data) ports • Less complex than packet-switched • Simple, distributed arbitration • Exploitable ordering for coherence

  5. data data data supplier predictor Ring in action R R R S S S Oracle Lazy Eager cmp snoop request response Courtesy: UIUC-ISCA06

  6. Ring in action R R R S S S Oracle Lazy Eager latency snoops messages • goal:adaptive schemes that approximate Oracle’s behavior Courtesy: UIUC-ISCA06

  7. X Primitive snooping actions • snoop and then forward + fewer messages • forward and then snoop + shorter latency • forward only + fewer snoops X + shorter latency – false negative predictions not allowed Courtesy: UIUC-ISCA06

  8. Predictor implementation • Subset • associative table: • subset of addresses that can be supplied by node • Superset • bloom filter: superset of addresses that can be supplied by node • associative table (exclude cache): • addresses that recently suffered false positives • Exact • associative table: all addresses that can be supplied by node • downgrading: if address has to be evicted from predictor table, corresponding line in node has to be downgraded

  9. Eager Subset Lazy SupersetAgg SupersetCon Oracle Algorithms Per miss service: / Exact number of messages number of snoops snoop message latency Courtesy: UIUC-ISCA06

  10. Experiments • 8 CMPs, 4 ooo cores each = 32 cores • private L2 caches • on-chip bus interconnect • off-chip 2D torus interconnect with embedded unidirectional ring • per node predictors: latency of 3 processor cycles • sesc simulator (sesc.sourceforge.net) • SPLASH-2, SPECjbb, SPECweb

  11. Most cost-effective algorithms  Lazy 1.2 1 Eager 0.8 Oracle execution time Normalized 0.6 Subset 0.4 SupersetCon 0.2 SupersetAgg  0 Exact SPLASH-2 SPECjbb SPECweb  3.22 Lazy 2 Eager 1.5 Oracle Normalized energy consumption Subset 1 SupersetCon 0.5  SupersetAgg 0 Exact SPLASH-2 SPECjbb SPECweb • high performance: Superset Aggressive • faster than Eager at lower energy consumption • energy conscious: Superset Conservative • slightly slower than Eager at much lower energy consumption Courtesy: UIUC-ISCA06

  12. Issues • Implementation of Ring Interconnects • Already available in commercial processors • IBM Power series & Cell • Next generation Intel processors • Limited Scalability (medium range systems) • Power Dissipation • Overhead due to design of routers (trivial though) • Conflicts & Races • Ring is not totally ordered • may lead to retries (unbounded?)

  13. Ordering Points – avoiding retries • Employ a node as an ordering point statically • Creates total ordering of requests! • Performance hit due to centralization • Ring-Order (Wisconsin, MICRO 2006) • Combines ordering and eager forwarding • Ordering for stability • Eager forwarding for performance • Inspired from token coherence • Use of a priority token • breaks conflicts • provides ordering

  14. In-Network Cache Coherence • Traditional Directory based Coherence • Decouple the protocol and the Communication medium • Protocol optimizations for end-to-end communication • Network optimizations for reduced latency • Known problems due to technology scaling • Wire delays • Storage overhead of directories (80 core CMPs) • Expose the interconnection medium to the protocol • Directory = Centralized serializing point • Distribute the directory with-in the network

  15. Current: Directory-Based Protocol • Home node (H) keeps a list (or partial list) of sharers • Read: Round trip overhead • Reads don’t take advantage of locality • Write: Up to two round trip overhead • Invalidations inefficient H Courtesy:Princeton-MICRO06

  16. New Approach: Virtual Network • Network is not just for getting from A to B • It can keep track of sharers and maintain coherence • How? With trees stored within routers • Tree connecting home node with sharers • Tree stored within routers • On the way to home node, when requests “bump” into trees, routers will re-route them towards sharers. H Bump into tree Add to tree Courtesy:Princeton-MICRO06

  17. Reads Locate Data Efficiently • Read Request injected into the network • Tree constructed as read reply is sent back • New read injected into the network • Request is redirected by the network • Data is obtained from sharer and returned Legend H: Home node H Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Courtesy:Princeton-MICRO06

  18. Writes Invalidate Data Efficiently • Write Request injected into the network • In-transit invalidations • Acknowledgements spawned at leaf nodes • Wait for acknowledgements • Send out write reply Legend H: Home node H Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Courtesy:Princeton-MICRO06

  19. Router Micro-architecture Router Pipeline Routing Virtual Channel Arbitration Physical Channel Arbitration Crossbar Traversal Link Traversal Tree Cache Tree Cache H Tag Data Entry Courtesy:Princeton-MICRO06

  20. Router Micro-architecture: Virtual Tree Cache Virtual Tree Cache • Each cache line in the virtual tree cache contains • 2 bits (To, From root node) for each direction (NSEW) • 1 bit if the local node contains valid data • 1 busy bit • 1 Outstanding request bit Tag Data Entry N E W D B R S T F T F T F F T Courtesy:Princeton-MICRO06

  21. Methodology • Benchmarks: SPLASH-2 • Multithreaded simulator: Bochs running 16 or 64-way • Network and virtual tree simulator • Trace-driven • Full-system simulation becomes prohibitive with many cores • Motivation: explore scalability • Cycle-accurate • Each message is modeled • Models network contention (arbitration) • Calculates average request completion times Application Memory traces Memory access latency performance Network Simulator Bochs

  22. Performance Scalability 4x4 Mesh • Compare in-network virtual tree protocol to standard directory protocol • Good improvement • 4x4 Mesh: Avg. 35.5% read latency reduction, 41.2% write latency reduction • Scalable • 8x8 Mesh: Avg. 35% read and 48% write latency reduction 8x8 Mesh

  23. Storage Scalability • Directory protocol: O(# Nodes) • Virtual tree protocol: O(# Ports) ~O(1) • Storage overhead • 4x4 mesh: 56% more storage bits • 8x8 mesh: 58% fewer storage bits Directory Cache Line Tree Cache Line # of Nodes 18 bits 28 bits 16 66 bits 28 bits 64 Courtesy:Princeton-MICRO06

  24. Issues • Power Dissipation • Major Bottleneck • complex functionality in the routers • Complexity • Complexity of routers, arbiters, etc. • Lack of verification tools • CAD tool support • More research needed for ease of adoption

  25. Summary • More and more cores on a CMP • Natural trend towards distribution of state • More interaction between interconnects and protocols • Ring interconnects => medium size systems • Simple design, low cost • Limited Scalability • In-Network coherence => large scale systems • Scalability • Severe implementation and power constraints on chip

  26. Thank you & Questions

More Related