basic communication operations l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Basic Communication Operations PowerPoint Presentation
Download Presentation
Basic Communication Operations

Loading in 2 Seconds...

play fullscreen
1 / 46

Basic Communication Operations - PowerPoint PPT Presentation


  • 391 Views
  • Uploaded on

Basic Communication Operations. Carl Tropper Department of Computer Science. Building Blocks in Parallel Programs. Common interactions between processes in parallel programs-broadcast, reduction, prefix sum…. Study their behavior on standard networks-mesh, hypercube, linear arrays

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Basic Communication Operations' - Ava


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
basic communication operations

Basic Communication Operations

Carl Tropper

Department of Computer Science

building blocks in parallel programs
Building Blocks in Parallel Programs
  • Common interactions between processes in parallel programs-broadcast, reduction, prefix sum….
  • Study their behavior on standard networks-mesh, hypercube, linear arrays
  • Derive expressions for time complexity
    • Assumptions are: send/receive one message on a link at a time, can send on one link while receiving on another link, cut through routing, bidirectional links
    • Message transfer time on a link: ts + twm
broadcast and reduction
Broadcast and Reduction
  • One to all broadcast
  • All to one reduction
      • p processes have m words apiece
      • associative operation on each word-sum,product,max,min
  • Both used in
      • Gaussian elimination
      • Shortest paths
      • Inner product of vectors
slide4

One to all Broadcast on Ring or Linear Array

  • One to all broadcast
  • Naïve-send p-1 messages from source to other processes. Inefficient.
  • Recursive doubling (below). Furthest node,half the distance…. Log p steps
reduction on linear array
Reduction on Linear Array
  • Odd nodes send to preceding even nodes
  • 0,2 to 0| 6,4 to 4 concurrently
  • 4 to 0
matrix vector multiplication on an nxn mesh
Matrix vector multiplication on an nxn mesh
  • Broadcast input vector to each row
  • Each row does reduction
broadcast on a mesh
Broadcast on a Mesh
  • Build on array algorithm
  • One to all from source (0) to row (4,8,12)
  • One to all on the columns
broadcast on a hypercube
Broadcast on a Hypercube
  • Source sends to nodes with highest dimension(msb), then next lower dimension,destination and source to next lower dimension,…..,everyone who can sends to lowest dimension
broadcast on a tree
Broadcast on a Tree
  • Same idea as hypercube
  • Grey circles are switches
  • Start from 0 and go to 4, rest the same as hypercube algorithm
all to all broadcast on linear array ring
All to all broadcast on linear array/ring
  • Idea-each node is kept busy in a circular transfer
all to all broadcast
All to all broadcast
  • Mesh-each row does all to all. Nodes do all to all broadcast in their columns.
  • Hypercube-pairs of nodes exchange messages in each dimension. Requires log p steps.
all to all broadcast on a hypercube
All to all broadcast on a hypercube
  • Hypercube of dimension p is composed of two hypercubes of dimension p-1
  • Start with dimension one hypercubes. Adjacent nodes exchange messages
  • Adjacent nodes in dimension two hypercubes exchange messages
  • In general, adjacent nodes in next higher dimensional hypercubes exchange messages
  • Requires log p steps.
all to all broadcast time
All to all broadcast time
  • Ring or Linear Array T=(ts+twm)(p-1)
  • Mesh
      • 1st phase: √p simultaneous all to all broadcasts among √p nodes takes time (ts+ twm)(√p-1)
      • 2nd phase: size of each message is m√p, phase takes time (ts+twm√p) )(√p-1)
      • T=2ts(√p-1)+twm(p-1) is total time
  • Hypercube
      • Size of message in ith step is 2i-1m
      • Takes ts+2i-1twm for a pair of nodes to send and receive messages
      • T=∑I=1logp (ts+2i-1twm)=tslog p + twm(p-1)
observation on all to all broadcast time
Observation on all to all broadcast time
  • Send time neglecting start up is the same for all 3 architectures:twm(p-1)
  • Lower bound on communication time for all 3 architectures
all reduce operation
All Reduce Operation
  • All reduce operation-all nodes start with buffer of size m. Associative operation performed on all buffers-all nodes get same result.
  • Semantically equivalent to all to one reduction followed by one to all broadcast.
  • All reduce with one word implements barrier synch for message passing machines. No node can finish reduction before all nodes have contributed to the reduction.
  • Implement all reduce using all to all broadcast. Add message contents instead of concatenating messages.
  • In hypercube, T=(ts+twm)log p for log p steps because message size does not double in each dimension.
prefix sum operation
Prefix Sum Operation
  • Prefix sums (scans) are all partial sums, sk, of p numbers, n1,….,np-1, one number living on each node.
  • Node k starts out with nk and winds up with sk
  • Modify all to all broadcast. Each node only uses partial sums from nodes with smaller labels.
scatter and gather
Scatter and gather
  • Scatter (one to all personalized communication) source sends unique message to each destination (vs broadcast - same message for each destination)
    • Hypercube:use one to all broadcast. Node transfers half of its messages to one neighbor and half to other neighbor at each step.Data goes from one subcube to another. Log p steps.
  • Gather: Node collects unique messages from other nodes
    • Hypercube:Odd numbered nodes send buffers to even nodes in other (lower dimensional) cube. Continue….
all to all personalized communication
All to all personalized communication
  • Each node sends distinct messages to every other node.
  • Used in FFT, Matrix transpose, sample sort
  • Transpose of A[i,j] is AT=A[j,i] ,0≤ i,j ≤ n
  • Put row i {(i,o),(i,1),….,(i,n-1) } processor Pi
  • Transpose-(i,0) goes to P0, (i,1) goes to processor 1,(i,n) goes to Pn
  • Every processor sends a distinct element to every other processor!
  • All to all on ring,mesh,hypercube
all to all on ring mesh hypercube
All to all on ring,mesh,hypercube
  • Ring: All nodes send messages in same direction.
      • Each node sends p-1 message of size m to neighbor.
      • Nodes extract message(s) for them, forward rest
  • Mesh: p x p mesh. Each node groups destination nodes into columns
      • All to all personalized communication on each row
      • Upon arrival, messages are sorted by row
      • All to all communication on rows
  • Hypercube p node hypercube
    • p/2 links in same dimension connect 2 subcubes with p/2 nodes
      • Each node exchanges p/2 messages in a given dimension
all to all personalized communication optimal algorithm on a hypercube
All to all personalized communication -optimal algorithm on a hypercube
  • Nodes chose partners for exchange in order to not suffer congestion
  • In step j node i exchanges with node

i XOR j

      • First step-all nodes differing in lsb exchange messages…………..
      • Last step-all nodes differing msb exchange messages
  • E cube routing in hypercube implements this
      • (Hamming) distance between nodes is # non-zero bits in i xor j
      • Sort links corresponding to non-zero bits in ascending order
      • Route messages along these links
optimal algorithm
Optimal algorithm
  • T=(ts+twm)(p-1)
circular q shift
Circular q Shift
  • Node i sends packet to (i+q) mod p
  • Useful in string, image pattern matching, matrix comp…
  • 5 shift on 4x4 mesh
      • Everyone goes 1 to the right (q mod √p)
      • Everyone goes 1 upwards q/√p
      • Left column goes up before general upshift to make up for wraparound effect in first step
circular shift on hypercube
Circular shift on hypercube
  • Map linear array onto hypercube: map I to j, where j is the d bit reflected Gray code of i
what happens if we split messages into packets
What happens if we split messages into packets?
  • One to all broadcast:scatter operation followed by all to all broadcast
      • Scatter: tslogp p + tw(m/p)(p-1)
      • All to all of messages of size m/p: ts log p +tw(m/p)(p-1) on a hypercube
      • T~2 x (ts log p + twm)on a hypercube
      • Bottom line-double the start-up cost, but cost of tw reduced by (log p)/2
  • All to one reduction-dual of one to all broadcast, so all to all reduction followed by gather
  • All reduce combine the above two