Communication latency in distributed memory systems
Download
1 / 19

Communication latency in distributed memory systems - PowerPoint PPT Presentation


Communication (or network) latency = time between sending and starting to receive data on ... On a uniprocessor - memory wall speed gap between processor and memory ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Communication latency in distributed memory systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript



Latency l.jpg
Latency

  • Latency = time delay

  • Memory latency = time that elapses between making a request for a value stored in memory and receiving the associated value

  • Communication (or network) latency = time between sending and starting to receive data on a network link


Memory latency l.jpg
Memory latency

  • On a uniprocessor - memory wall – speed gap between processor and memory

  • On a NUMA machine (tightly coupled distributed memory multiprocessor), memory latency is

    • small for local reference

    • large for remote reference


Reduction of memory latency l.jpg
Reduction of memory latency

  • Tolerance (hiding) – hiding the effect of memory-access latencies by overlapping useful computation with memory references

  • Avoidance – minimize remote references by co-locating computation with data it accesses

  • Latency tolerance and avoidance complementary to each other



Memory latency tolerance l.jpg
Memory latency tolerance

  • Multithreading

    • when one thread waits for data access, the other is executed

    • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and memory access

  • Prefetching (into cache)

  • Out-of-order execution


Reduction of network communication latency l.jpg
Reduction ofNetwork (communication) latency

  • Tolerance (hiding) – hiding the effect of latencies by overlapping useful computation with communication

  • Avoidance – minimize communication by co-locating computation with data it accesses


Communication latency tolerance l.jpg
Communication latency tolerance

  • In general, successful only if work available

  • Multithreading

    • when one thread waits for communication, the other is executed

    • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and communication


Communication latency tolerance9 l.jpg
Communication latency tolerance

  • Multithreading

    • exploits parallelism across multiple threads

  • Prefetching

    • finds parallelism within a single thread

    • request for data (i.e. prefetch request) must be moved back far in advance of the use of data in the execution stream

    • requires ability to predict what data is needed ahead of time


Comm latency avoidance in software dsms l.jpg
Comm latency avoidance in software DSMs

  • Data replication in local memory, using local memory as cache for remote locations

  • Relaxed consistency models


Slide11 l.jpg
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • Comm latency avoidance is not sufficient

  • Comm latency tolerance is needed

  • By combining both prefetching and multithreading such that

    • multithreading hides synch latency, and

    • prefetching hides memory latency

      3 of the 8 apps can achieve better performance than when we use either technique individually.


Slide12 l.jpg
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • But combining prefetching and multithreading such that both techniques attempt to hide memory latency is not a good idea – redundant overhead

  • Best overall approach depends on

    • predictability of memory access patterns

    • the extent to which lock stalls dominate synch time

    • etc


Comm latency tolerance in message passing l.jpg
Comm latency tolerance in message passing Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • “Appropriate” placement of non-blocking communication calls

  • MPI’s non-blocking calls

    • MPI_Isend() and MPI_Irecv()


Mpi s non blocking send l.jpg
MPI’s non-blocking send Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • A non-blocking post-send (MPI_Isend) initiates a send operation, but does not complete it

  • The post-send may returnbefore the msg is copied out of the send buffer

  • A separate complete-send (MPI_Wait) call is needed to verify that the send operation has completed, i.e. data copied out of the send buffer


Mpi s non blocking receive l.jpg
MPI’s non-blocking receive Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • A non-blocking post-receive (MPI_Irecv ) initiates a receive operation, but does not complete it

  • The post-receive may returnbefore the msg is stored into the receive buffer

  • A separate complete-receive (MPI_Wait) call is needed to verify that the receive operation has completed, i.e. data received into the receive buffer


Slide16 l.jpg


Appropriate placement of non blocking calls l.jpg
“appropriate” placement of non-blocking calls Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • To achieve max overlap between computation and communication, communications should be

    • started as soon as possible

    • completed as late as possible

  • Send should be

    • Posted as soon as the data to be sent is available

    • Completed just before the send buffer is to be reused

  • Receive should be

    • Posted as soon as the receive buffer can be reused

    • Completed just before the data in the receive buffer is to be used

  • Sometimes, overlap can be increased by re-ordering computations


Communication latency tolerance in messengers c l.jpg
Communication latency tolerance in MESSENGERS-C Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • Multithreading

  • Multiple sending and receiving threads

  • I/O multiplexing using poll() (similar to select()) with sockets


References l.jpg
References Software Distributed Shared Memory – Mowry, Chan and Lo 1998

  • Matthew Haines, Wim Bohm, “An Evaluation of Software Multithreading in a Conventional Distributed Memory Multiprocessor” 1993

  • Yong Yan, Xiaodong Zhang, Zhao Zhang, “A Memory-layout Oriented Run-time Technique for Locality Optimization” 1995

  • P H Wang, Wang Hong, J D Collins, E Grochowski, M Kling, J P Shen, “Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation” Eighth International Symposium on High-Performance Computer Architecture, 2002


ad
  • Login