communication in tightly coupled systems l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Communication in Tightly Coupled Systems PowerPoint Presentation
Download Presentation
Communication in Tightly Coupled Systems

Loading in 2 Seconds...

play fullscreen
1 / 58

Communication in Tightly Coupled Systems - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Communication in Tightly Coupled Systems. CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002. Why Parallel Computing? Performance!. Processor Performance. But not just Performance.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Communication in Tightly Coupled Systems' - donny


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
communication in tightly coupled systems

Communication in Tightly Coupled Systems

CS 519: Operating System Theory

Computer Science, Rutgers University

Instructor: Thu D. Nguyen

TA: Xiaoyan Li

Spring 2002

why parallel computing performance
Why Parallel Computing? Performance!

CS 519: Operating System Theory

processor performance
Processor Performance

CS 519: Operating System Theory

but not just performance
But not just Performance
  • At some point, we’re willing to trade some performance for:
    • Ease of programming
    • Portability
    • Cost
  • Ease of programming & Portability
    • Parallel programming for the masses
    • Leverage new or faster hardware asap
  • Cost
    • High-end parallel machines are expensive resources

CS 519: Operating System Theory

amdahl s law
Amdahl’s Law
  • If a fraction s of a computation is not parallelizable, then the best achievable speedup is

CS 519: Operating System Theory

pictorial depiction of amdahl s law

1

p

1

Time

Pictorial Depiction of Amdahl’s Law

CS 519: Operating System Theory

parallel applications
Parallel Applications
  • Scientific computing not the only class of parallel applications
  • Examples of non-scientific parallel applications:
    • Data mining
    • Real-time rendering
    • Distributed servers

CS 519: Operating System Theory

centralized memory multiprocessors
Centralized Memory Multiprocessors

CPU

CPU

Memory

cache

cache

memory bus

I/O bus

disk

Net interface

CS 519: Operating System Theory

distributed shared memory numa multiprocessors
Distributed Shared-Memory (NUMA) Multiprocessors

CPU

CPU

Memory

Memory

cache

cache

memory bus

memory bus

I/O bus

I/O bus

network

disk

Net interface

disk

Net interface

CS 519: Operating System Theory

multicomputers
Multicomputers

CPU

CPU

Memory

Memory

cache

cache

memory bus

memory bus

I/O bus

I/O bus

network

disk

Net interface

disk

Net interface

Inter-processor communication in multicomputers

is effected through message passing

CS 519: Operating System Theory

basic message passing

Send

Receive

P0

P1

N0

N1

Communication Fabric

Basic Message Passing

Send

Receive

P0

P1

N0

CS 519: Operating System Theory

terminology
Terminology
  • Basic Message Passing:
    • Send: Analogous to mailing a letter
    • Receive: Analogous to picking up a letter from the mailbox
    • Scatter-gather:Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message
  • Network performance:
    • Latency: The time from when a Send is initiated until the first byte is received by a Receive.
    • Bandwidth: The rate at which a sender is able to send data to a receiver.

CS 519: Operating System Theory

scatter gather
Scatter-Gather

Scatter (Receive)

Gather (Send)

Message

Message

Memory

Memory

CS 519: Operating System Theory

basic message passing easy right
Basic Message Passing: Easy, Right?
  • What can be easier than this, right?
  • Well, think of the post office: to send a letter

CS 519: Operating System Theory

basic message passing not so easy
Basic Message Passing: Not So Easy
  • Why is it so complicated to send a letter if basic message passing is so easy?
  • Well, it’s really not easy! Issues include:
    • Naming:How to specify the receiver?
    • Routing:How to forward the message to the correct receiver through intermediaries?
    • Buffering: What if the out port is not available? What if the receiver is not ready to receive the message?
    • Reliability:What if the message is lost in transit? What if the message is corrupted in transit?
    • Blocking: What if the receiver is ready to receive before the sender is ready to send?

CS 519: Operating System Theory

traditional message passing implementation

M

M

M

M

S

R

S

R

M

S

R

Traditional Message Passing Implementation
  • Kernel-based message passing: unnecessary data copying and traps into the kernel

S

R

CS 519: Operating System Theory

reliability
Reliability
  • Reliability problems:
    • Message loss
      • Most common approach: If don’t get a reply/ack msg within some time interval, resend
    • Message corruption
      • Most common approach: Send additional information (e.g., error correction code) so receiver can reconstruct data or simply detect corruption, if part of msg is lost or damaged. If reconstruction is not possible, throw away corrupted msg and pretend it was lost
    • Lack of buffer space
      • Most common approach: Control the flow and size of messages to avoid running out of buffer space

CS 519: Operating System Theory

reliability18
Reliability
  • Reliability is indeed a very hard problem in large-scale networks such as the Internet
    • Network is unreliable
    • Message loss can greatly impact performance
    • Mechanisms to address reliability can be costly even when there’s no message loss
  • Reliability is not as hard for parallel machines
    • Underlying network hardware is much more reliable
    • Less prone to buffer overflow, cause often have hardware flow-control

Address reliability later, for loosely coupled systems

CS 519: Operating System Theory

computation vs communication cost
Computation vs. Communication Cost
  • 200 MHz clock  5 ns instruction cycle
  • Memory access:
    • L1: ~2-4 cycles  10-20 ns
    • L2: ~5-10 cycles  25-50 ns
    • Memory: ~50-200 cycles  250-1000 ns
  • Message roundtrip latency:
    • ~20 s
    • Suppose 75% hit ratio in L1, no L2, 10 ns L1 access time, 500 ns memory access time  average memory access time 132.5 ns
    • 1 message roundtrip latency = 151 memory accesses

CS 519: Operating System Theory

performance always performance
Performance … Always Performance!
  • So … obviously, when we talk about message passing, we want to know how to optimize for performance
  • But … which aspects of message passing should we optimize?
    • We could try to optimize everything
      • Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly
  • Subject of Martin et al., “Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture,” ISCA 1997.

CS 519: Operating System Theory

martin et al logp model
Martin et al.: LogP Model

CS 519: Operating System Theory

sensitivity to loggp parameters
Sensitivity to LogGP Parameters
  • LogGP parameters:
    • L = delay incurred in passing a short msg from source to dest
    • o = processor overhead involved in sending or receiving a msg
    • g = min time between msg transmissions or receptions (msg bandwidth)
    • G = bulk gap = time per byte transferred for long transfers (byte bandwidth)
  • Workstations connected with Myrinet network and Generic Active Messages layer
  • Delay insertion technique
  • Applications written in Split-C but perform their own data caching

CS 519: Operating System Theory

sensitivity to overhead
Sensitivity to Overhead

CS 519: Operating System Theory

sensitivity to gap
Sensitivity to Gap

CS 519: Operating System Theory

sensitivity to latency
Sensitivity to Latency

CS 519: Operating System Theory

sensitivity to bulk gap
Sensitivity to Bulk Gap

CS 519: Operating System Theory

summary
Summary
  • Runtime strongly dependent on overhead and gap
  • Strong dependence on gap because of burstiness of communication
  • Not so sensitive to latency  can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor)
  • Not sensitive to bulk gap  got more bandwidth than we know what to do with

CS 519: Operating System Theory

what s the point
What’s the Point?
  • What can we take away from Martin et al.’s study?

It’s extremely important to reduce overhead because it may affect both “o” and “g”

All the “action” is currently in the OS and the Network Interface Card (NIC)

  • Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.

CS 519: Operating System Theory

an efficient low level message passing interface

An Efficient Low-Level Message Passing Interface

von Eicken et al., “Active Messages: a Mechanism for Integrated Communication and Computation,” ISCA 1992

von Eicken et al., “U-Net: A User-Level Network Interface for Parallel and Distributed Computing,” SOSP 1995

Santos, Bianchini, and Amorim, “A Survey of Messaging Software Issues and Systems for Myrinet-Based Clusters”, PDCP 1999

von eicken et al active messages
von Eicken et al.: Active Messages
  • Design challenge for large-scale multiprocessor:
    • Minimize communication overhead
    • Allow computation to overlap communication
    • Coordinate the above two without sacrificing processor cost/performance
  • Problems with traditional message passing:
    • Send/receive are usually synchronous; no overlap between communication and computation
    • If not synchronous, needs buffering (inside the kernel) on the receive side
  • Active Messages approach:
    • Asynchronous communication model (send and continue)
    • Message specifies handler that integrates msg into on-going computation on the receiving side

CS 519: Operating System Theory

buffering
Buffering
  • Remember buffering problem: what to do if receiver not ready to receive?
    • Drop the message
      • This is typically very costly because of recovery costs
    • Leave the message in the NIC
      • Reduce network utilization
      • Can result in deadlocks
    • Wait until receiver is ready – synchronous or 3-phase protocol
    • Copy to OS buffer and later copy to user buffer

CS 519: Operating System Theory

3 phase protocol
3-phase Protocol

CS 519: Operating System Theory

copying

Incoming Message

Copying

Process

Address Space

Message

Buffers

OS Address Space

CS 519: Operating System Theory

copying don t do it
Copying - Don’t Do It!

Hennessy

and

Patterson,

1996

CS 519: Operating System Theory

overhead of many native mis too high
Overhead of Many Native MIs Too High
  • Recall that overhead is critical to appl performance
  • Asynchronous send and receive overheads on many platforms (back in 1991): Ts = time to start a message; Tb = time/byte; Tfb = time/flop (for comparison)

CS 519: Operating System Theory

von eicken et al active receive
von Eicken et al.: Active Receive
  • Key idea is really to optimize receive - Buffer management is more complex on receiver

Handler

Message Data

CS 519: Operating System Theory

active receive more efficient
Active Receive More Efficient

P1

P0

Active Message

P1

P0

Copying

P1

P0

OS

OS

CS 519: Operating System Theory

active message performance
Active Message Performance

Send

Receive

Instructions

Time (

m

s)

Instructions

Time (

m

s)

NCUBE2

21

11.0

34

15.0

CM-5

1.6

1.7

Main difference between these AM implementations is that the

CM-5 allows direct, user-level access to the network interface.

More on this in a minute!

CS 519: Operating System Theory

any drawback to active message
Any Drawback To Active Message?
  • Active message  SPMD
    • SPMD: Single Program Multiple Data
  • This is because sender must know address of handler on receiver
  • Not absolutely necessary, however
    • Can use indirection, i.e. have a table mapping handler Ids to addresses on receiver. Mapping has a performance cost, though.

CS 519: Operating System Theory

user level access to nic
User-Level Access to NIC
  • Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level

CS 519: Operating System Theory

user level communication
User-level Communication
  • Basic idea: remove the kernel from the critical path of sending and receiving messages
    • user-memory to user-memory: zero copy
    • permission is checked once when the mapping is established
    • buffer management left to the application
  • Advantages
    • low communication latency
    • low processor overhead
    • approach raw latency and bandwidth provided by the network
  • One approach: U-Net

CS 519: Operating System Theory

u net abstraction
U-Net Abstraction

CS 519: Operating System Theory

u net endpoints
U-Net Endpoints

CS 519: Operating System Theory

u net basics
U-Net Basics
  • Protection provided by endpoints and communication channels
    • Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory)
    • Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints
  • For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS.
  • Message queues can be placed at different memories to optimize polling
    • Receive queue allocated in host memory
    • Send and free queues allocated in NIC memory

CS 519: Operating System Theory

u net performance on atm
U-Net Performance on ATM

CS 519: Operating System Theory

u net udp performance
U-Net UDP Performance

CS 519: Operating System Theory

u net tcp performance
U-Net TCP Performance

CS 519: Operating System Theory

u net latency
U-Net Latency

CS 519: Operating System Theory

virtual memory mapped communication
Virtual Memory-Mapped Communication
  • Receiver exports the receive buffers
  • Sender must import a receive buffer before sending
  • The permission of sender to write into the receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program)
  • Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention
  • At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention

CS 519: Operating System Theory

virtual to physical address
Virtual-to-physical address

receiver

sender

int send_buffer[1024];

recv_id = import(receiver, exp_id);

send(recv_id, send_buffer);

int rec_buffer[1024];

exp_id = export(rec_buffer, sender);

recv(exp_id);

  • In order to store data directly into the application address space (exported buffers), the NI must know the virtual to physical translations
  • What to do?

CS 519: Operating System Theory

software tlb in network interface
Software TLB in Network Interface
  • The network interface must incorporate a TLB (NI-TLB) which is kept consistent with the virtual memory system
  • When a message arrives, NI attempts a virtual to physical translation using the NI-TLB
  • If a translation is found, NI transfers the data to the physical address in the NI-TLB entry
  • If a translation is missing in the NI-TLB, the processor is interrupted to provide the translation. If the page is not currently in memory, the processor will bring the page in. In any case, the kernel increments the reference count for that page to avoid swapping
  • When a page entry is evicted from the NI-TLB, the kernel is informed to decrement the reference count
  • Swapping prevented while DMA in progress

CS 519: Operating System Theory

introduction to collective communication
Introduction to Collective Communication

CS 519: Operating System Theory

collective communication
Collective Communication
  • More than two processes involved in communication
    • Barrier
    • Broadcast (one-to-all), multicast (one-to-many)
    • All-to-all
    • Reduction (all-to-one)

CS 519: Operating System Theory

barrier
Barrier

Compute

Compute

Compute

Compute

Barrier

Compute

Compute

Compute

Compute

CS 519: Operating System Theory

broadcast and multicast
Broadcast and Multicast

Broadcast

Multicast

P1

P1

Message

Message

P0

P2

P0

P2

P3

P3

CS 519: Operating System Theory

all to all
All-to-All

Message

Message

P0

P2

P1

P3

Message

Message

CS 519: Operating System Theory

reduction
Reduction

A[0]

sum  0

for i  1 to p do

sum  sum + A[i]

A[0] + A[1]

A[1]

A[0] + A[1] + A[2] + A[3]

A[2]

A[2] + A[3]

A[3]

P1

P1

A[1]

A[1]

P0

P2

A[2] + A[3]

P0

P2

A[2]

A[3]

A[3]

P3

P3

CS 519: Operating System Theory