1 / 36

Cache Coherence and Interconnection Network Design

Cache Coherence and Interconnection Network Design. Adapted from UC, Berkeley Notes. Cache-based (Pointer-based) Schemes. How they work: home only holds pointer to rest of directory info distributed linked list of copies, weaves through caches

donovan
Download Presentation

Cache Coherence and Interconnection Network Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Coherence and Interconnection Network Design Adapted from UC, Berkeley Notes

  2. Cache-based (Pointer-based) Schemes • How they work: • home only holds pointer to rest of directory info • distributed linked list of copies, weaves through caches • cache tag has pointer, points to next cache with a copy • on read, add yourself to head of the list (comm. needed) • on write, propagate chain of invalidations down the list • What if a link fails? => Scalable Coherent Interface (SCI) IEEE Standard • doubly linked list • What to do on replacement? CS258 S99

  3. Scaling Properties (Cache-based) • Traffic on write: proportional to number of sharers • Latency on write: proportional to number of sharers! • don’t know identity of next sharer until reach current one • also assist processing at each node along the way • (even reads involve more than one other assist: home and first sharer on list) • Storage overhead: quite good scaling along both axes • Only one head pointer per memory block • rest is all prop to cache size • Very complex!!! CS258 S99

  4. Hierarchical Directories • Directory is a hierarchical data structure • leaves are processing nodes, internal nodes just directory • logical hierarchy, not necessarily physical • (can be embedded in general network) CS258 S99

  5. Caching in the Internet (Server side caching, Client side caching, Network caching) CS258 S99

  6. Network Caching – Proxy Caching –Ref: Jia Wang, “A Survey of Web Caching Schemes for the Internet” ACM Computing Surveys CS258 S99

  7. Comparison with P2P • Directory is similar to tracker in Bit Torrent (BT) or root in P2P network – Plaxton’s root has one pointer and BT has all pointers • Real object may be somewhere else (call it as home node), as pointed by directory • The shared caches are extra locations of the object equivalent to peers • Duplicate directories possible – see hierarchical directory design (MIND) later Lot of possible designs for directory – Can we do similar design in P2P? CS258 S99

  8. Plaxton’s Scheme • Similar to Distributed Shared Memory (DSM) Cache Protocol • Root is the Home node in DSM, where directory is kept but not the real object. The root points to the nearest home peer, where a shared copy can be found. • Similar to Hierarchical Directory scheme, where intermediate nodes point to nearest object nodes (equivalent to shared copy). However, in hierarchical directory scheme Intermediate pointers point to all shared copies. CS258 S99

  9. P2P Research • How many pointers do we need at the tracker? Cost model? Depends on popularity? How many peers to supply to client? Communication model? How to update directory? • Cache-based (Linked-list) approach like SCI? • Hierarchical directory – Plaxton’s paper – Introduce such method for BT network? • Combine directory+tree DHTs? • Is Dirty copy possible in P2P? Then develop protocol similar to cache • Sub-page coherence like DSM? CS258 S99

  10. Interconnection Network Design Adapted from UC, Berkeley Notes

  11. Scalable, High Perf. Interconnection Network • At Core of Parallel Computer Arch. • Requirements and trade-offs at many levels • Elegant mathematical structure • Deep relationships to algorithm structure • Managing many traffic flows • Electrical / Optical link properties • Little consensus • interactions across levels • Performance metrics? • Cost metrics? • Workload? => need holistic understanding CS258 S99

  12. Goals Job of a multiprocessor network is to transfer information from source node to destination node in support of network transactions that realize the programming model • latency as small as possible • as many concurrent transfers as possible • operation bandwidth • data bandwidth • cost as low as possible CS258 S99

  13. Formalism • network is a graph V = {switches and nodes} connected by communication channels C Í V ´ V • Channel has width w and signaling rate f = 1/t • channel bandwidth b = wf • phit (physical unit) data transferred per cycle • flit - basic unit of flow-control • Number of input (output) channels is switch degree • Sequence of switches and links followed by a message is a route • Think streets and intersections CS258 S99

  14. What characterizes a network? • Topology (what) • physical interconnection structure of the network graph • direct: node connected to every switch • indirect: nodes connected to specific subset of switches • Routing Algorithm (which) • restricts the set of paths that msgs may follow • many algorithms with different properties • gridlock avoidance? • Switching Strategy (how) • how data in a msg traverses a route • circuit switching vs. packet switching • Flow Control Mechanism (when) • when a msg or portions of it traverse a route • what happens when traffic is encountered? CS258 S99

  15. Topological Properties • Routing Distance - number of links on route • Diameter - maximum routing distance between any two nodes in the network • Average Distance – Sum of distances between nodes/number of nodes • Degree of a Node – Number of links connected to a node => Cost high if degree is high • A network is partitioned by a set of links if their removal disconnects the graph • Fault-tolerance – Number of alternate paths between two nodes in a network CS258 S99

  16. Typical Packet Format • Two basic mechanisms for abstraction • encapsulation • fragmentation CS258 S99

  17. Sender Overhead Transmission time (size ÷ bandwidth) Review: Performance Metrics Sender (processor busy) Time of Flight Transmission time (size ÷ bandwidth) Receiver Overhead Receiver (processor busy) Transport Latency Total Latency Total Latency = Sender Overhead + Time of Flight+ Message Size ÷ BW +Receiver Overhead Includes header/trailer in BW calculation? CS258 S99

  18. Switching Techniques • Circuit Switching: A control message is sent from source to destination and a path is reserved. Communication starts. The path is released when communication is complete. • Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switch (good for WAN) • Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately • In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches). CM-5 uses it, with each switch buffer being 4 bits per port. • Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermmediate switch. (Requires a buffer large enough to hold the largest packet). CS258 S99

  19. CS258 S99

  20. Store and Forward vs. Cut-Through • Advantage • Latency reduces from function of:number of intermediate switches X by the size of the packet totime for 1st part of the packet to negotiate the switches + the packet size ÷ interconnect BW CS258 S99

  21. Store&Forward vs Cut-Through Routing • h(n/b + D) vs n/b + h D • what if message is fragmented? • wormhole vs virtual cut-through CS258 S99

  22. CS258 S99

  23. Example Q. Compare the efficiency of store-and-forward (packet switching) vs. wormhole routing for transmission of a 20 bytes packet between a source and destination, which are d-nodes apart. Each node takes 0.25 microsecond and link transfer rate is 20 MB/sec. Answer: Time to transfer 20 bytes over a link = 20/20 MB/sec = 1 microsecond. Packet switching: # nodes x (node delay + transfer time)= d x (.25 + 1) = 1.25 d microseconds Wormhole: (# nodes x node delay) + transfer time = 0.25 d + 1 Book: For d=7, packet switching takes 8.75 microseconds vs. 2.75 microseconds for wormhole routing

  24. Contention • Two packets trying to use the same link at same time • limited buffering • drop? • Most parallel mach. networks block in place • link-level flow control • tree saturation • Closed system - offered load depends on delivered CS258 S99

  25. Delay with Queuing Suppose there are L links per node. Each link sends a packet to another link at “Lamda” packets/sec. The service rate (link+switch) is “Mue” packets per second. What is the delay over a distance D? Ans: There is a queue at each output link to hold extra packets. Model each output link as an M/M/1 queue with LxLamda input rate and Mue service rate. Delay through each link = Queuing time + Service time = S Delay over a distance D = S x D CS258 S99

  26. Congestion Control • Packet switched networks do not reserve bandwidth; this leads to contention(connection based limits input) • Solution: prevent packets from entering until contention is reduced (e.g., freeway on-ramp metering lights) • Options: • Packet discarding: If packet arrives at switch and no room in buffer, packet is discarded (e.g., UDP) • Flow control: between pairs of receivers and senders; use feedback to tell sender when allowed to send next packet • Back-pressure: separate wires to tell to stop • Window: give original sender right to send N packets before getting permission to send more; overlaps latency of interconnection with overhead to send & receive packet (e.g., TCP), adjustable window • Choke packets: aka “rate-based”; Each packet received by busy switch in warning state sent back to the source via choke packet. Source reduces traffic to that destination by a fixed % (e.g., ATM) CS258 S99

  27. Routing • Recall: routing algorithm determines • which of the possible paths are used as routes • how the route is determined • R: N x N -> C, which at each switch maps the destination node nd to the next channel on the route • Issues: • Routing mechanism • arithmetic • source-based port select • table driven • general computation • Properties of the routes • Deadlock feee CS258 S99

  28. Routing Mechanism • need to select output port for each input packet • in a few cycles • Simple arithmetic in regular topologies • ex: Dx, Dy routing in a grid • west (-x) Dx < 0 • east (+x) Dx > 0 • south (-y) Dx = 0, Dy < 0 • north (+y) Dx = 0, Dy > 0 • processor Dx = 0, Dy = 0 • Reduce relative address of each dimension in order • Dimension-order routing in k-ary d-cubes • e-cube routing in n-cube CS258 S99

  29. Routing Mechanism (cont) • Source-based • message header carries series of port selects • used and stripped en route • CRC? Packet Format? • CS-2, Myrinet, MIT Artic • Table-driven • message header carried index for next port at next switch • o = R[i] • table also gives index for following hop • o, I’ = R[i ] • ATM, HPPI P3 P2 P1 P0 CS258 S99

  30. Properties of Routing Algorithms • Deterministic • route determined by (source, dest), not intermediate state (i.e. traffic) • Adaptive • route influenced by traffic along the way • Minimal • only selects shortest paths • Deadlock free • no traffic pattern can lead to a situation where no packets mover forward CS258 S99

  31. Deadlock Freedom • How can it arise? • necessary conditions: • shared resource • incrementally allocated • non-preemptible • think of a channel as a shared resource that is acquired incrementally • source buffer then dest. buffer • channels along a route • How do you avoid it? • constrain how channel resources are allocated • ex: dimension order • How do you prove that a routing algorithm is deadlock free CS258 S99

  32. Proof Technique • resources are logically associated with channels • messages introduce dependencies between resources as they move forward • need to articulate the possible dependences that can arise between channels • show that there are no cycles in Channel Dependence Graph • find a numbering of channel resources such that every legal route follows a monotonic sequence • => no traffic pattern can lead to deadlock • network need not be acyclic, on channel dependence graph CS258 S99

  33. Example: k-ary 2D array • Theorem: x,y routing is deadlock free • Numbering • +x channel (i,y) -> (i+1,y) gets i • similarly for -x with 0 as most positive edge • +y channel (x,j) -> (x,j+1) gets N+j • similarly for -y channels • any routing sequence: x direction, turn, y direction is increasing CS258 S99

  34. Deadlock free wormhole networks? • Basic dimension order routing techniques don’t work for k-ary d-cubes • only for k-ary d-arrays (bi-directional) • Idea: add channels! • provide multiple “virtual channels” to break the dependence cycle • good for BW too! • Do not need to add links, or xbar, only buffer resources • This adds nodes the the CDG, remove edges? CS258 S99

  35. Breaking deadlock with virtual channels CS258 S99

  36. Adaptive Routing • R: C x N x S -> C • Essential for fault tolerance • at least multipath • Can improve utilization of the network • Simple deterministic algorithms easily run into bad permutations • fully/partially adaptive, minimal/non-minimal • can introduce complexity or anomolies • little adaptation goes a long way! CS258 S99

More Related