communication latency in distributed memory systems l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Communication latency in distributed memory systems PowerPoint Presentation
Download Presentation
Communication latency in distributed memory systems

Loading in 2 Seconds...

play fullscreen
1 / 19

Communication latency in distributed memory systems - PowerPoint PPT Presentation


  • 279 Views
  • Uploaded on

Communication (or network) latency = time between sending and starting to receive data on ... On a uniprocessor - memory wall speed gap between processor and memory ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Communication latency in distributed memory systems' - Kelvin_Ajay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
latency
Latency
  • Latency = time delay
  • Memory latency = time that elapses between making a request for a value stored in memory and receiving the associated value
  • Communication (or network) latency = time between sending and starting to receive data on a network link
memory latency
Memory latency
  • On a uniprocessor - memory wall – speed gap between processor and memory
  • On a NUMA machine (tightly coupled distributed memory multiprocessor), memory latency is
    • small for local reference
    • large for remote reference
reduction of memory latency
Reduction of memory latency
  • Tolerance (hiding) – hiding the effect of memory-access latencies by overlapping useful computation with memory references
  • Avoidance – minimize remote references by co-locating computation with data it accesses
  • Latency tolerance and avoidance complementary to each other
memory latency tolerance
Memory latency tolerance
  • Multithreading
    • when one thread waits for data access, the other is executed
    • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and memory access
  • Prefetching (into cache)
  • Out-of-order execution
reduction of network communication latency
Reduction ofNetwork (communication) latency
  • Tolerance (hiding) – hiding the effect of latencies by overlapping useful computation with communication
  • Avoidance – minimize communication by co-locating computation with data it accesses
communication latency tolerance
Communication latency tolerance
  • In general, successful only if work available
  • Multithreading
    • when one thread waits for communication, the other is executed
    • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and communication
communication latency tolerance9
Communication latency tolerance
  • Multithreading
    • exploits parallelism across multiple threads
  • Prefetching
    • finds parallelism within a single thread
    • request for data (i.e. prefetch request) must be moved back far in advance of the use of data in the execution stream
    • requires ability to predict what data is needed ahead of time
comm latency avoidance in software dsms
Comm latency avoidance in software DSMs
  • Data replication in local memory, using local memory as cache for remote locations
  • Relaxed consistency models
slide11
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998
  • Comm latency avoidance is not sufficient
  • Comm latency tolerance is needed
  • By combining both prefetching and multithreading such that
    • multithreading hides synch latency, and
    • prefetching hides memory latency

3 of the 8 apps can achieve better performance than when we use either technique individually.

slide12
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998
  • But combining prefetching and multithreading such that both techniques attempt to hide memory latency is not a good idea – redundant overhead
  • Best overall approach depends on
    • predictability of memory access patterns
    • the extent to which lock stalls dominate synch time
    • etc
comm latency tolerance in message passing
Comm latency tolerance in message passing
  • “Appropriate” placement of non-blocking communication calls
  • MPI’s non-blocking calls
    • MPI_Isend() and MPI_Irecv()
mpi s non blocking send
MPI’s non-blocking send
  • A non-blocking post-send (MPI_Isend) initiates a send operation, but does not complete it
  • The post-send may returnbefore the msg is copied out of the send buffer
  • A separate complete-send (MPI_Wait) call is needed to verify that the send operation has completed, i.e. data copied out of the send buffer
mpi s non blocking receive
MPI’s non-blocking receive
  • A non-blocking post-receive (MPI_Irecv ) initiates a receive operation, but does not complete it
  • The post-receive may returnbefore the msg is stored into the receive buffer
  • A separate complete-receive (MPI_Wait) call is needed to verify that the receive operation has completed, i.e. data received into the receive buffer
slide16
MPI_send = MPI_Isend + MPI_Wait
  • MPI_recv = MPI_Irecv + MPI_Wait
appropriate placement of non blocking calls
“appropriate” placement of non-blocking calls
  • To achieve max overlap between computation and communication, communications should be
    • started as soon as possible
    • completed as late as possible
  • Send should be
    • Posted as soon as the data to be sent is available
    • Completed just before the send buffer is to be reused
  • Receive should be
    • Posted as soon as the receive buffer can be reused
    • Completed just before the data in the receive buffer is to be used
  • Sometimes, overlap can be increased by re-ordering computations
communication latency tolerance in messengers c
Communication latency tolerance in MESSENGERS-C
  • Multithreading
  • Multiple sending and receiving threads
  • I/O multiplexing using poll() (similar to select()) with sockets
references
References
  • Matthew Haines, Wim Bohm, “An Evaluation of Software Multithreading in a Conventional Distributed Memory Multiprocessor” 1993
  • Yong Yan, Xiaodong Zhang, Zhao Zhang, “A Memory-layout Oriented Run-time Technique for Locality Optimization” 1995
  • P H Wang, Wang Hong, J D Collins, E Grochowski, M Kling, J P Shen, “Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation” Eighth International Symposium on High-Performance Computer Architecture, 2002