1 / 19

Communication latency in distributed memory systems

Communication (or network) latency = time between sending and starting to receive data on ... On a uniprocessor - memory wall speed gap between processor and memory ...

Kelvin_Ajay
Download Presentation

Communication latency in distributed memory systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication latency in distributed memory systems Ming Kin Lai May 4, 2007

  2. Latency • Latency = time delay • Memory latency = time that elapses between making a request for a value stored in memory and receiving the associated value • Communication (or network) latency = time between sending and starting to receive data on a network link

  3. Memory latency • On a uniprocessor - memory wall – speed gap between processor and memory • On a NUMA machine (tightly coupled distributed memory multiprocessor), memory latency is • small for local reference • large for remote reference

  4. Reduction of memory latency • Tolerance (hiding) – hiding the effect of memory-access latencies by overlapping useful computation with memory references • Avoidance – minimize remote references by co-locating computation with data it accesses • Latency tolerance and avoidance complementary to each other

  5. Memory latency avoidance • Use of cache

  6. Memory latency tolerance • Multithreading • when one thread waits for data access, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and memory access • Prefetching (into cache) • Out-of-order execution

  7. Reduction ofNetwork (communication) latency • Tolerance (hiding) – hiding the effect of latencies by overlapping useful computation with communication • Avoidance – minimize communication by co-locating computation with data it accesses

  8. Communication latency tolerance • In general, successful only if work available • Multithreading • when one thread waits for communication, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and communication

  9. Communication latency tolerance • Multithreading • exploits parallelism across multiple threads • Prefetching • finds parallelism within a single thread • request for data (i.e. prefetch request) must be moved back far in advance of the use of data in the execution stream • requires ability to predict what data is needed ahead of time

  10. Comm latency avoidance in software DSMs • Data replication in local memory, using local memory as cache for remote locations • Relaxed consistency models

  11. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • Comm latency avoidance is not sufficient • Comm latency tolerance is needed • By combining both prefetching and multithreading such that • multithreading hides synch latency, and • prefetching hides memory latency 3 of the 8 apps can achieve better performance than when we use either technique individually.

  12. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • But combining prefetching and multithreading such that both techniques attempt to hide memory latency is not a good idea – redundant overhead • Best overall approach depends on • predictability of memory access patterns • the extent to which lock stalls dominate synch time • etc

  13. Comm latency tolerance in message passing • “Appropriate” placement of non-blocking communication calls • MPI’s non-blocking calls • MPI_Isend() and MPI_Irecv()

  14. MPI’s non-blocking send • A non-blocking post-send (MPI_Isend) initiates a send operation, but does not complete it • The post-send may returnbefore the msg is copied out of the send buffer • A separate complete-send (MPI_Wait) call is needed to verify that the send operation has completed, i.e. data copied out of the send buffer

  15. MPI’s non-blocking receive • A non-blocking post-receive (MPI_Irecv ) initiates a receive operation, but does not complete it • The post-receive may returnbefore the msg is stored into the receive buffer • A separate complete-receive (MPI_Wait) call is needed to verify that the receive operation has completed, i.e. data received into the receive buffer

  16. MPI_send = MPI_Isend + MPI_Wait • MPI_recv = MPI_Irecv + MPI_Wait

  17. “appropriate” placement of non-blocking calls • To achieve max overlap between computation and communication, communications should be • started as soon as possible • completed as late as possible • Send should be • Posted as soon as the data to be sent is available • Completed just before the send buffer is to be reused • Receive should be • Posted as soon as the receive buffer can be reused • Completed just before the data in the receive buffer is to be used • Sometimes, overlap can be increased by re-ordering computations

  18. Communication latency tolerance in MESSENGERS-C • Multithreading • Multiple sending and receiving threads • I/O multiplexing using poll() (similar to select()) with sockets

  19. References • Matthew Haines, Wim Bohm, “An Evaluation of Software Multithreading in a Conventional Distributed Memory Multiprocessor” 1993 • Yong Yan, Xiaodong Zhang, Zhao Zhang, “A Memory-layout Oriented Run-time Technique for Locality Optimization” 1995 • P H Wang, Wang Hong, J D Collins, E Grochowski, M Kling, J P Shen, “Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation” Eighth International Symposium on High-Performance Computer Architecture, 2002

More Related