1 / 56

CS 505: Computer Structures Networks

CS 505: Computer Structures Networks. Thu D. Nguyen Spring 2005 Computer Science Rutgers University. Send. Receive. P0. P1. N0. N1. Communication Fabric. Basic Message Passing. Send. Receive. P0. P1. N0. Terminology. Basic Message Passing: Send: Analogous to mailing a letter

keely-gross
Download Presentation

CS 505: Computer Structures Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 505: Computer StructuresNetworks Thu D. Nguyen Spring 2005 Computer Science Rutgers University

  2. Send Receive P0 P1 N0 N1 Communication Fabric Basic Message Passing Send Receive P0 P1 N0

  3. Terminology • Basic Message Passing: • Send: Analogous to mailing a letter • Receive: Analogous to picking up a letter from the mailbox • Scatter-gather:Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message • Network performance: • Latency: The time from when a Send is initiated until the first byte is received by a Receive. • Bandwidth: The rate at which a sender is able to send data to a receiver.

  4. Scatter-Gather Scatter (Receive) Gather (Send) … … Message Message Memory Memory

  5. Network Topologies

  6. Terminology • Network partition: When a network is broken into two or more components that cannot communicate with each others. • Diameter: Maximum length of shortest path between any two processors. • Connectivity:Measure of the multiplicity of paths between any two processors - Minimum number of links that must be removed to partition the network. • Bisection width: Minimum number of links that must be removed to partition the network into two equal halves. • Bisection bandwidth: Minimum volume of communication allowed between any two halves of the network with an equal number of processors.

  7. Bisection Bandwidth Bisection Bandwidth = Bisection Width * Link Bandwidth

  8. Typical Network Diagram

  9. Typical Node CPU Memory NIC Router

  10. Bus-Based Network • Advantages • Simple • Diameter = 1 • Disadvantages • Blocking • Bandwidth does not scale with p • Easy to partition network Bus

  11. Completely-Connected Network • Advantages • Diameter = 1 • Bandwidth scales with p • Non-blocking • Difficult to partition network • Disadvantages • Number of links grows O(p2) • Fan-in (and out) at each node grows linearly with p

  12. Star Network • Essentially same as Bus-Based Network

  13. Ring Network

  14. Mesh and Torus Network

  15. Multistage Network

  16. Perfect Shuffle

  17. Omega Network - Log(p) Stages

  18. Blocking in Omega Network

  19. Tree Network

  20. Fat Tree Network

  21. Hypercube Network

  22. Hypercube Network

  23. k-ary d-cube Networks • k:radix of the network - the number of processors in each dimension • d:dimension of the network • k-ary d-cube can be constructed from k k-ary (d-1)-cubes by connecting the nodes occupying identical positions into rings • Examples: • Hypercube:binary d-cube • Ring:p-ary 1-cube

  24. Arbitrary Topology Networks Switch Switch Switch

  25. Network Characteristics

  26. Packet vs. Wormhole Routing Message Packets Worm

  27. Store-and-Forward vs. Cut-Through Routing • Store-and-Forward: • Cannot route/forward a packet until the entire packet has been received • Cut-Through: • Can route/forward a packet as soon as the router has received and processed the header • Worm-hole is always cut-through because not enough buffer space to hold entire message • Packet routing is almost always cut-through as well • Difference: when blocked, a worm can span multiple routers while a packet will fit entirely into the buffer of a single router

  28. Collective Communication Primitives • Send/Receive necessary and sufficient • Broadcast, multicast • one-to-all, all-to-all, one-to-all personalized, all-to-all personalized • flood • Reduction • all-to-one, all-to-all • Scatter, gather • Barrier

  29. Broadcast and Multicast Broadcast Multicast P1 P1 Message Message P0 P2 P0 P2 P3 P3

  30. All-to-All Message Message P0 P2 P1 P3 Message Message

  31. Reduction A[0] sum  0 for i  1 to p do sum  sum + A[i] A[0] + A[1] A[1] A[0] + A[1] + A[2] + A[3] A[2] A[2] + A[3] A[3] P1 P1 A[1] A[1] P0 P2 A[2] + A[3] P0 P2 A[2] A[3] A[3] P3 P3

  32. Ring Broadcast O(p)

  33. Ring Broadcast O(logp)

  34. Mesh Broadcast

  35. Computation vs. Communication Cost • 2GHz clock => 1/2 ns instruction cycle • Memory access: • L1: ~2-4 cycles => 1-2 ns • L2: ~5-10 cycles => 2.5-5 ns • Memory: ~120-300 cycles => 60-150 ns • Message roundtrip latency: • ~20 s • Suppose 75% hit ratio in L1, no L2, 1 ns L1 access time, 200 ns memory access time => average memory access time 51 ns • 1 message roundtrip latency = ~400 memory accesses

  36. Performance … Always Performance! • So … obviously, when we talk about message passing, we want to know how to optimize for performance • But … which aspects of message passing should we optimize? • We could try to optimize everything • Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly

  37. Martin et al.: LogP Model

  38. Sensitivity to LogGP Parameters • LogGP parameters: • L = delay incurred in passing a short msg from source to dest • o = processor overhead involved in sending or receiving a msg • g = min time between msg transmissions or receptions (msg bandwidth) • G = bulk gap = time per byte transferred for long transfers (byte bandwidth) • Workstations connected with Myrinet network and Generic Active Messages layer • Delay insertion technique • Applications written in Split-C but perform their own data caching

  39. Sensitivity to Overhead

  40. Sensitivity to Gap

  41. Sensitivity to Latency

  42. Sensitivity to Bulk Gap

  43. Summary • Runtime strongly dependent on overhead and gap • Strong dependence on gap because of burstiness of communication • Not so sensitive to latency => can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor) • Not sensitive to bulk gap => got more bandwidth than we know what to do with

  44. What’s the Point? • What can we take away from Martin et al.’s study? • It’s extremely important to reduce overhead because it may affect both “o” and “g” • All the “action” is currently in the OS and the Network Interface Card (NIC) • Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.

  45. User-Level Access to NIC • Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level

  46. User-level Communication • Basic idea: remove the kernel from the critical path of sending and receiving messages • user-memory to user-memory: zero copy • permission is checked once when the mapping is established • buffer management left to the application • Advantages • low communication latency • low processor overhead • approach raw latency and bandwidth provided by the network • One approach: U-Net

  47. U-Net Abstraction

  48. U-Net Endpoints

  49. U-Net Basics • Protection provided by endpoints and communication channels • Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory) • Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints • For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS. • Message queues can be placed at different memories to optimize polling • Receive queue allocated in host memory • Send and free queues allocated in NIC memory

  50. U-Net Performance on ATM

More Related