1 / 19

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand. Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen.

lyn
Download Presentation

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen Host-Assisted Zero-Copy Remote Memory Access Communication On InfiniBand Tipparaju, Santhanaraman, Nieplocha, Panda Presented by Nikola Vouk Advisor: Dr. Frank Mueller

  2. Background General Buffer Manipulation in Communication Protoocls

  3. InfiniBand • 7.6 microsecond latency • 857 MB/s peak bandwidth • Send/Receive Queue+Work Completed interface • Asynchronous calls • Remote Direct Memory Access • Between Shared memory architecture and MPI • Not exactly NUMA, but close • Provides channel Interface (read/write) for communication • Each side registers memory accessible freely to other hosts for security purposes.

  4. Common Problems • Link-layer/Network Protocol in-efficiencies (unnecessary messages sent) • User-space to System-Buffer copy overhead (copy time) • Synchronous sending/receiving and computing (Application has to stop in order to handle requests)

  5. Get Operation Copy data from shared memory to user buffer Adjust Tail Pointer RDMA write new tail pointer to sender Return bytes read Problem 1: Message Passing Protocol Basic InfiniBand protocol requires three matching writes RDMA CHANNEL INTERFACE Put Operation: • Copy user buffer to pre-registered buffer • RDMA write buffer to receiver • Adjust local head pointer • RDMA write new head pointer to receiver • Return Bytes written

  6. Solutions:Piggybacking and Pipelining Send Pointer update with Packets Chop buffers into packet size and Send out as message comes in Improvement, but still less than 870 MB/s

  7. Problem 2: Internal buffer copying overheadSolution: Zero-Copy Buffers • Internal overhead where the user must copy data to system (and into a registered memory slot) • Allows system to read directly from the user

  8. Zero-Copy Protocol at different Levels of MPICH Hierarchy If Packet is Large enough… Register user buffer Notify end-host of request End-host sends a RDMA-read Reads from user buffer space

  9. Comparing Interfaces: CH3 interface vs RDMA Interface • Implement directly off of CH3 interface • More flexible due to access to complete ADI-3 interface • Always uses RMDA-write

  10. CH3 Implementation Performance A function of raw underlying performance

  11. Pipelining always performed the worst • RDMA Channel within 1% of CH3

  12. Unanswered Problems Registration overhead still there even in cached version Data transfer still requires significant cooperation from both sides (taking away from computation) Non-contiguous data not addressed Solutions Provide custom API allocates out of large pre-registers memory chunks Overlapping as much as possible communication with computation Applying zero-copy techniques using scatter/gather RMDA calls Problem 3: To much overhead, not enough execution

  13. Host-Assisted Zero-Copy Protocol • Host sends request for gather from receiver • Receiver posts a descriptor and continues working • Can be implemented as a “helper” thread on receiving host • Same as previous Zero-Copy idea, but extended to Non-contiguous data

  14. NAS MG • Again the Pipelined method performs similarly to the zero-copy method

  15. Summa Matrix Multiplication • Significant benefit of Host-Assisted Zero-Copy

  16. Minimizing internal memory copying removes primary memory performance obstacle Infiniband allows DMA that offloads work from the CPU. Can benefit by coordinating registered memory to minimize CPU involvment With proper coding, can achieve almost wire-speed on existing MPI programs over infiniband Could be implemented on other architectures (Gig-E, Myranet) Conclusions

  17. Thesis Implications • Buddy MPICH is a latency hiding implementation of MPICH also. • Separation at the ADI layer. Buddy thread listens for connections and accepts work from worker thread via send/receive queues.

More Related