Basic Communication Operations

1 / 46

# Basic Communication Operations - PowerPoint PPT Presentation

Basic Communication Operations. Carl Tropper Department of Computer Science. Building Blocks in Parallel Programs. Common interactions between processes in parallel programs-broadcast, reduction, prefix sum…. Study their behavior on standard networks-mesh, hypercube, linear arrays

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Basic Communication Operations' - Ava

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Basic Communication Operations

Carl Tropper

Department of Computer Science

Building Blocks in Parallel Programs
• Common interactions between processes in parallel programs-broadcast, reduction, prefix sum….
• Study their behavior on standard networks-mesh, hypercube, linear arrays
• Derive expressions for time complexity
• Message transfer time on a link: ts + twm
• All to one reduction
• p processes have m words apiece
• associative operation on each word-sum,product,max,min
• Both used in
• Gaussian elimination
• Shortest paths
• Inner product of vectors

One to all Broadcast on Ring or Linear Array

• Naïve-send p-1 messages from source to other processes. Inefficient.
• Recursive doubling (below). Furthest node,half the distance…. Log p steps
Reduction on Linear Array
• Odd nodes send to preceding even nodes
• 0,2 to 0| 6,4 to 4 concurrently
• 4 to 0
Matrix vector multiplication on an nxn mesh
• Broadcast input vector to each row
• Each row does reduction
• Build on array algorithm
• One to all from source (0) to row (4,8,12)
• One to all on the columns
• Source sends to nodes with highest dimension(msb), then next lower dimension,destination and source to next lower dimension,…..,everyone who can sends to lowest dimension
• Same idea as hypercube
• Grey circles are switches
• Start from 0 and go to 4, rest the same as hypercube algorithm
All to all broadcast on linear array/ring
• Idea-each node is kept busy in a circular transfer
• Mesh-each row does all to all. Nodes do all to all broadcast in their columns.
• Hypercube-pairs of nodes exchange messages in each dimension. Requires log p steps.
All to all broadcast on a hypercube
• Hypercube of dimension p is composed of two hypercubes of dimension p-1
• Adjacent nodes in dimension two hypercubes exchange messages
• In general, adjacent nodes in next higher dimensional hypercubes exchange messages
• Requires log p steps.
• Ring or Linear Array T=(ts+twm)(p-1)
• Mesh
• 1st phase: √p simultaneous all to all broadcasts among √p nodes takes time (ts+ twm)(√p-1)
• 2nd phase: size of each message is m√p, phase takes time (ts+twm√p) )(√p-1)
• T=2ts(√p-1)+twm(p-1) is total time
• Hypercube
• Size of message in ith step is 2i-1m
• Takes ts+2i-1twm for a pair of nodes to send and receive messages
• T=∑I=1logp (ts+2i-1twm)=tslog p + twm(p-1)
Observation on all to all broadcast time
• Send time neglecting start up is the same for all 3 architectures:twm(p-1)
• Lower bound on communication time for all 3 architectures
All Reduce Operation
• All reduce operation-all nodes start with buffer of size m. Associative operation performed on all buffers-all nodes get same result.
• Semantically equivalent to all to one reduction followed by one to all broadcast.
• All reduce with one word implements barrier synch for message passing machines. No node can finish reduction before all nodes have contributed to the reduction.
• In hypercube, T=(ts+twm)log p for log p steps because message size does not double in each dimension.
Prefix Sum Operation
• Prefix sums (scans) are all partial sums, sk, of p numbers, n1,….,np-1, one number living on each node.
• Node k starts out with nk and winds up with sk
• Modify all to all broadcast. Each node only uses partial sums from nodes with smaller labels.
Scatter and gather
• Scatter (one to all personalized communication) source sends unique message to each destination (vs broadcast - same message for each destination)
• Hypercube:use one to all broadcast. Node transfers half of its messages to one neighbor and half to other neighbor at each step.Data goes from one subcube to another. Log p steps.
• Gather: Node collects unique messages from other nodes
• Hypercube:Odd numbered nodes send buffers to even nodes in other (lower dimensional) cube. Continue….
All to all personalized communication
• Each node sends distinct messages to every other node.
• Used in FFT, Matrix transpose, sample sort
• Transpose of A[i,j] is AT=A[j,i] ,0≤ i,j ≤ n
• Put row i {(i,o),(i,1),….,(i,n-1) } processor Pi
• Transpose-(i,0) goes to P0, (i,1) goes to processor 1,(i,n) goes to Pn
• Every processor sends a distinct element to every other processor!
• All to all on ring,mesh,hypercube
All to all on ring,mesh,hypercube
• Ring: All nodes send messages in same direction.
• Each node sends p-1 message of size m to neighbor.
• Nodes extract message(s) for them, forward rest
• Mesh: p x p mesh. Each node groups destination nodes into columns
• All to all personalized communication on each row
• Upon arrival, messages are sorted by row
• All to all communication on rows
• Hypercube p node hypercube
• p/2 links in same dimension connect 2 subcubes with p/2 nodes
• Each node exchanges p/2 messages in a given dimension
• Nodes chose partners for exchange in order to not suffer congestion
• In step j node i exchanges with node

i XOR j

• First step-all nodes differing in lsb exchange messages…………..
• Last step-all nodes differing msb exchange messages
• E cube routing in hypercube implements this
• (Hamming) distance between nodes is # non-zero bits in i xor j
• Sort links corresponding to non-zero bits in ascending order
• Route messages along these links
Optimal algorithm
• T=(ts+twm)(p-1)
Circular q Shift
• Node i sends packet to (i+q) mod p
• Useful in string, image pattern matching, matrix comp…
• 5 shift on 4x4 mesh
• Everyone goes 1 to the right (q mod √p)
• Everyone goes 1 upwards q/√p
• Left column goes up before general upshift to make up for wraparound effect in first step
Circular shift on hypercube
• Map linear array onto hypercube: map I to j, where j is the d bit reflected Gray code of i
What happens if we split messages into packets?
• One to all broadcast:scatter operation followed by all to all broadcast
• Scatter: tslogp p + tw(m/p)(p-1)
• All to all of messages of size m/p: ts log p +tw(m/p)(p-1) on a hypercube
• T~2 x (ts log p + twm)on a hypercube
• Bottom line-double the start-up cost, but cost of tw reduced by (log p)/2
• All to one reduction-dual of one to all broadcast, so all to all reduction followed by gather
• All reduce combine the above two