1 / 27

Efficient Asynchronous Message Passing via SCI with Zero-Copying

Efficient Asynchronous Message Passing via SCI with Zero-Copying. SCI Europe 2001 – Trinity College Dublin. Joachim Worringen * , Friedrich Seifert + , Thomas Bemmerl *. Agenda. What is Zero-Copying? What is it good for? Zero-Copying with SCI Support through SMI-Library

karik
Download Presentation

Efficient Asynchronous Message Passing via SCI with Zero-Copying

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Asynchronous Message Passing via SCI with Zero-Copying SCI Europe 2001 – Trinity College Dublin Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl*

  2. Agenda • What is Zero-Copying? What is it good for? Zero-Copying with SCI • Support through SMI-Library Shared Memory Interface • Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations • Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  3. Zero-Copying • Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies:N-way–Copying • No intermediate copy:Zero-Copying • Effective Bandwidth and Efficiency: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  4. SCI DMA FastEthernet GigaEthernet Efficiency Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  5. Zero-Copying with SCI • SCI does zero-copy by nature. But: SCI via IO-Bus is limited: • No SMP-style shared memory • Specially allocated memory regions were required • No general zero-copy possible New possibility: • Using user-allocated buffers for SCI communication • Allows general zero-copy! Connection setup is always required. SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  6. SMI LibraryShared Memory Interface • High-Level SCI support library • for parallel applications or libraries • Application startup • Synchronization & basic communication • Shared-Memory setup: • Collective regions • Point-2-point regions • Individual regions • Dynamic memory management • Data transfer SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  7. Data Moving (I) • Shared Memory Paradigm: • Import remote memory in local address space • Perform memcpy() or maybe DMA • SMI Support: • region type REMOTE • Synchronous (PIO): SMI_Memcpy() • Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() • Problems: • High Mapping Overhead • Resource Usage (ATT entries on PCI-SCI adapter) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  8. Mapping Overhead • Not suitable for dynamic memory setups! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  9. Data Moving (II) • Connection Paradigm: • Connect to remote memory location • No representation in local address space • only DMA possible • SMI support: • Region type RDMA • Synchronous / Asynchronous DMA:SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait • Problems: • Alignment restrictions • Source needs to be pinned down SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  10. Setup Acceleration • Memory buffer setup costs time ! • Reduce number of operations to increase performance • Desirable: only one operation per buffer • Problem: limited ressources • Solution: caching of SCI segment states by lazy-release • Leave buffers registered, remote segments connected or mapped • Release unneeded resources if setup of new resource fails • Different replacement strategies possible:LRU, LFU, best-fit, random, immediate • Attention: remote segment deallocation! • Callback on connection event to release local connection • MPI persistent communication operations: • Pre-register user buffer & higher „hold“ priority SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  11. Memory Allocation • Allocate „good“ memory: • MPI_Alloc_mem() / MPI_Free_mem() • Part of MPI-2 (mostly for single-sided operations) • SCI-MPICH defines attributes: • type: shared, private or default • Shared memory performs best. • alignment: none, specified or default • Non-shared memory should be page-aligned • „Good“ memory should only be enforced for communication buffers! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  12. Zero-Copy Protocols • Applicable for hand-shake based rendez-vous protocol • Requirements: • registered user allocated buffers or • regular SCI segments • „good“ memory via MPI_Alloc_mem() • State of memory range must be known • SMI provides query functionality • Registering / Connection / Mapping may fail • Several different setups possible • Fallback mechanism required SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  13. Sender Receiver Application Thread Device Thread Device Thread Application Thread Isend Ask to send OK to send Done Continue Done Asynchronous Rendez-Vous Control Messages Irecv Data Transfer Wait Wait SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  14. Test Setup • Systems used for performance evaluation: • Pentium-III @ 800 MHz • 512 MB RAM @ 133 MHz • 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) • Dolphin D330 (single ring topology) • Linux 2.4.4-bigphysarea • modified SCI driver (user memory for SCI) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  15. Bandwidth Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  16. Application Kernel: NPB IS • Parallel bucket sort • Keys are integer numbers • Dominant communication:MPI_Alltoallv for distributed key array: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  17. MPI_Alltoallv Performance • MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall • Improved performance with asynchronous DMA operations • Application speedup deduced SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  18. Computation Synchronous Asynchronous totaltime computation time Asynchronous Communication • Goal: Overlap Computation & Communication • How to quantify the efficiency for this? • Typical overlapping effect: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  19. Saturation and Efficiency(I) • Two parameters are required: • Saturation s • Duration of computation period required to make total time (communication & computation) increase • Efficiency e • Relation of overhead to message latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  20. Computation Synchronous Asynchronous Saturation s Saturation and Efficiency(II) ttotal tmsg_a ttotal - tbusy tmsg_s tbusy SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  21. Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  22. Experimental Setup: Spinning Different ways of keeping CPU busy: • FIXEDSpin on single variable for a given amount of CPU time • No memory stress • DAXPYPerform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size) • Stress memory system SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  23. DAXPY – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  24. DAXPY – 256kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  25. FIXED – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  26. Asynchronous Performance • Saturation and Efficiency derived from experiments: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

  27. Summary & Outlook • Efficient utilization of new SCI driver functionality for MPI communication: • Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) • Connection overhead hidden by segment caching • Asynchronous communication pays off much earlier than before • New (?)quantification scheme for efficiency of asynchronous communication • Flexible MPI memory allocation supports MPI application writer • Connection-oriented DMA transfers reduce resource utilization • DMA alignment problems • Segment callback required for improved connection caching SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

More Related