1 / 58

Communication in Tightly Coupled Systems

Communication in Tightly Coupled Systems. CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002. Why Parallel Computing? Performance!. Processor Performance. But not just Performance.

donny
Download Presentation

Communication in Tightly Coupled Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication in Tightly Coupled Systems CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002

  2. Why Parallel Computing? Performance! CS 519: Operating System Theory

  3. Processor Performance CS 519: Operating System Theory

  4. But not just Performance • At some point, we’re willing to trade some performance for: • Ease of programming • Portability • Cost • Ease of programming & Portability • Parallel programming for the masses • Leverage new or faster hardware asap • Cost • High-end parallel machines are expensive resources CS 519: Operating System Theory

  5. Amdahl’s Law • If a fraction s of a computation is not parallelizable, then the best achievable speedup is CS 519: Operating System Theory

  6. 1 p 1 Time Pictorial Depiction of Amdahl’s Law CS 519: Operating System Theory

  7. Parallel Applications • Scientific computing not the only class of parallel applications • Examples of non-scientific parallel applications: • Data mining • Real-time rendering • Distributed servers CS 519: Operating System Theory

  8. Centralized Memory Multiprocessors CPU CPU Memory cache cache memory bus I/O bus disk Net interface CS 519: Operating System Theory

  9. Distributed Shared-Memory (NUMA) Multiprocessors CPU CPU Memory Memory cache cache memory bus memory bus I/O bus I/O bus network disk Net interface disk Net interface CS 519: Operating System Theory

  10. Multicomputers CPU CPU Memory Memory cache cache memory bus memory bus I/O bus I/O bus network disk Net interface disk Net interface Inter-processor communication in multicomputers is effected through message passing CS 519: Operating System Theory

  11. Send Receive P0 P1 N0 N1 Communication Fabric Basic Message Passing Send Receive P0 P1 N0 CS 519: Operating System Theory

  12. Terminology • Basic Message Passing: • Send: Analogous to mailing a letter • Receive: Analogous to picking up a letter from the mailbox • Scatter-gather:Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message • Network performance: • Latency: The time from when a Send is initiated until the first byte is received by a Receive. • Bandwidth: The rate at which a sender is able to send data to a receiver. CS 519: Operating System Theory

  13. Scatter-Gather Scatter (Receive) Gather (Send) … … Message Message Memory Memory CS 519: Operating System Theory

  14. Basic Message Passing: Easy, Right? • What can be easier than this, right? • Well, think of the post office: to send a letter CS 519: Operating System Theory

  15. Basic Message Passing: Not So Easy • Why is it so complicated to send a letter if basic message passing is so easy? • Well, it’s really not easy! Issues include: • Naming:How to specify the receiver? • Routing:How to forward the message to the correct receiver through intermediaries? • Buffering: What if the out port is not available? What if the receiver is not ready to receive the message? • Reliability:What if the message is lost in transit? What if the message is corrupted in transit? • Blocking: What if the receiver is ready to receive before the sender is ready to send? CS 519: Operating System Theory

  16. M M M M S R S R M S R Traditional Message Passing Implementation • Kernel-based message passing: unnecessary data copying and traps into the kernel S R CS 519: Operating System Theory

  17. Reliability • Reliability problems: • Message loss • Most common approach: If don’t get a reply/ack msg within some time interval, resend • Message corruption • Most common approach: Send additional information (e.g., error correction code) so receiver can reconstruct data or simply detect corruption, if part of msg is lost or damaged. If reconstruction is not possible, throw away corrupted msg and pretend it was lost • Lack of buffer space • Most common approach: Control the flow and size of messages to avoid running out of buffer space CS 519: Operating System Theory

  18. Reliability • Reliability is indeed a very hard problem in large-scale networks such as the Internet • Network is unreliable • Message loss can greatly impact performance • Mechanisms to address reliability can be costly even when there’s no message loss • Reliability is not as hard for parallel machines • Underlying network hardware is much more reliable • Less prone to buffer overflow, cause often have hardware flow-control Address reliability later, for loosely coupled systems CS 519: Operating System Theory

  19. Computation vs. Communication Cost • 200 MHz clock  5 ns instruction cycle • Memory access: • L1: ~2-4 cycles  10-20 ns • L2: ~5-10 cycles  25-50 ns • Memory: ~50-200 cycles  250-1000 ns • Message roundtrip latency: • ~20 s • Suppose 75% hit ratio in L1, no L2, 10 ns L1 access time, 500 ns memory access time  average memory access time 132.5 ns • 1 message roundtrip latency = 151 memory accesses CS 519: Operating System Theory

  20. Performance … Always Performance! • So … obviously, when we talk about message passing, we want to know how to optimize for performance • But … which aspects of message passing should we optimize? • We could try to optimize everything • Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly • Subject of Martin et al., “Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture,” ISCA 1997. CS 519: Operating System Theory

  21. Martin et al.: LogP Model CS 519: Operating System Theory

  22. Sensitivity to LogGP Parameters • LogGP parameters: • L = delay incurred in passing a short msg from source to dest • o = processor overhead involved in sending or receiving a msg • g = min time between msg transmissions or receptions (msg bandwidth) • G = bulk gap = time per byte transferred for long transfers (byte bandwidth) • Workstations connected with Myrinet network and Generic Active Messages layer • Delay insertion technique • Applications written in Split-C but perform their own data caching CS 519: Operating System Theory

  23. Sensitivity to Overhead CS 519: Operating System Theory

  24. Sensitivity to Gap CS 519: Operating System Theory

  25. Sensitivity to Latency CS 519: Operating System Theory

  26. Sensitivity to Bulk Gap CS 519: Operating System Theory

  27. Summary • Runtime strongly dependent on overhead and gap • Strong dependence on gap because of burstiness of communication • Not so sensitive to latency  can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor) • Not sensitive to bulk gap  got more bandwidth than we know what to do with CS 519: Operating System Theory

  28. What’s the Point? • What can we take away from Martin et al.’s study? It’s extremely important to reduce overhead because it may affect both “o” and “g” All the “action” is currently in the OS and the Network Interface Card (NIC) • Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992. CS 519: Operating System Theory

  29. An Efficient Low-Level Message Passing Interface von Eicken et al., “Active Messages: a Mechanism for Integrated Communication and Computation,” ISCA 1992 von Eicken et al., “U-Net: A User-Level Network Interface for Parallel and Distributed Computing,” SOSP 1995 Santos, Bianchini, and Amorim, “A Survey of Messaging Software Issues and Systems for Myrinet-Based Clusters”, PDCP 1999

  30. von Eicken et al.: Active Messages • Design challenge for large-scale multiprocessor: • Minimize communication overhead • Allow computation to overlap communication • Coordinate the above two without sacrificing processor cost/performance • Problems with traditional message passing: • Send/receive are usually synchronous; no overlap between communication and computation • If not synchronous, needs buffering (inside the kernel) on the receive side • Active Messages approach: • Asynchronous communication model (send and continue) • Message specifies handler that integrates msg into on-going computation on the receiving side CS 519: Operating System Theory

  31. Buffering • Remember buffering problem: what to do if receiver not ready to receive? • Drop the message • This is typically very costly because of recovery costs • Leave the message in the NIC • Reduce network utilization • Can result in deadlocks • Wait until receiver is ready – synchronous or 3-phase protocol • Copy to OS buffer and later copy to user buffer CS 519: Operating System Theory

  32. 3-phase Protocol CS 519: Operating System Theory

  33. Incoming Message Copying Process Address Space Message Buffers OS Address Space CS 519: Operating System Theory

  34. Copying - Don’t Do It! Hennessy and Patterson, 1996 CS 519: Operating System Theory

  35. Overhead of Many Native MIs Too High • Recall that overhead is critical to appl performance • Asynchronous send and receive overheads on many platforms (back in 1991): Ts = time to start a message; Tb = time/byte; Tfb = time/flop (for comparison) CS 519: Operating System Theory

  36. Message Latency on Two Different LAN Technologies CS 519: Operating System Theory

  37. von Eicken et al.: Active Receive • Key idea is really to optimize receive - Buffer management is more complex on receiver Handler Message Data CS 519: Operating System Theory

  38. Active Receive More Efficient P1 P0 Active Message P1 P0 Copying P1 P0 OS OS CS 519: Operating System Theory

  39. Active Message Performance Send Receive Instructions Time ( m s) Instructions Time ( m s) NCUBE2 21 11.0 34 15.0 CM-5 1.6 1.7 Main difference between these AM implementations is that the CM-5 allows direct, user-level access to the network interface. More on this in a minute! CS 519: Operating System Theory

  40. Any Drawback To Active Message? • Active message  SPMD • SPMD: Single Program Multiple Data • This is because sender must know address of handler on receiver • Not absolutely necessary, however • Can use indirection, i.e. have a table mapping handler Ids to addresses on receiver. Mapping has a performance cost, though. CS 519: Operating System Theory

  41. User-Level Access to NIC • Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level CS 519: Operating System Theory

  42. User-level Communication • Basic idea: remove the kernel from the critical path of sending and receiving messages • user-memory to user-memory: zero copy • permission is checked once when the mapping is established • buffer management left to the application • Advantages • low communication latency • low processor overhead • approach raw latency and bandwidth provided by the network • One approach: U-Net CS 519: Operating System Theory

  43. U-Net Abstraction CS 519: Operating System Theory

  44. U-Net Endpoints CS 519: Operating System Theory

  45. U-Net Basics • Protection provided by endpoints and communication channels • Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory) • Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints • For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS. • Message queues can be placed at different memories to optimize polling • Receive queue allocated in host memory • Send and free queues allocated in NIC memory CS 519: Operating System Theory

  46. U-Net Performance on ATM CS 519: Operating System Theory

  47. U-Net UDP Performance CS 519: Operating System Theory

  48. U-Net TCP Performance CS 519: Operating System Theory

  49. U-Net Latency CS 519: Operating System Theory

  50. Virtual Memory-Mapped Communication • Receiver exports the receive buffers • Sender must import a receive buffer before sending • The permission of sender to write into the receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program) • Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention • At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention CS 519: Operating System Theory

More Related