Definitions
Download
1 / 23

Definitions - PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on

Definitions. A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Definitions' - byron-osborne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Definitions
Definitions

  • A Synchronous application is one where all processes must reach certain points before execution continues.

  • Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues.

  • A barrier is the basic message passing mechanism for synchronizing processes.

  • Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.


Barrier illustration

P0

P0

Barrier

P0

Executing

P0

Waiting

P0

Barrier Illustration

C: MPI_Barrier(MPI_COMM_WORLD);

mpiJava: MPI.COMM_WORLD.Barrier();


Counter linear barrier
Counter (linear) Barrier

Master Processor

For (i=0; i<P; i++) // Arrival Phase

Receive null message from any processor

For (i=0; i<P; i++) // Departure Phase

Send null message to release slaves

Slave Processors

Send null message to enter barrier

Receive null message for barrier release

Note: This logic avoids processors arriving before prior release


Tree non linear barrier

P0

P1

P2

P3

P4

P5

P6

P7

Tree (non-linear) Barrier

P0

P1

P2

P3

P4

P5

P6

Release Phase

P7

Entry Phase

Note: Implementation logic is similar to divide and conquer


Barrier barrier

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

Barrier Barrier

  • Stage 1: P0p1; p2p3; p4p5; p6p7

  • Stage 2: p0p2; p1p3; p4p6; p5p7

  • Stage 3: p0p4; p1p5; p2p6; p3p7


Local synchronization
Local Synchronization

Synchronize with neighbors before proceeding

  • Even Processors

    Send null message to processor i-1

    Receive null message from processor i-1

    Send null message to processor i+1

    Receive null message from processor i+1

  • Odd Numbered Processors

    Receive null message from processor i+1

    Send null message to processor i+1

    Receive null message from processor i-1

    Send null message to processor i-1

  • Notes:

    • Local Synchronization is an incomplete barrier

      • processors exit after receiving messages from their neighbors

    • Deadlock can occur if the message passing order is incorrect.

      • MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free


Local synchronization example
Local Synchronization Example

  • Heat Distribution Problem

    • Goal

      • Determine final temperature at each n x n grid point

    • Initial boundary condition

      • Know initial temperatures at the end points

    • Cannot proceed to next iteration until local synchronization completes

      DO

      Average each grid point with its neighbors

      UNTIL temperature changes are small enough

New Value =

(∑neighbors)/4


Sequential heat distribution code
Sequential Heat Distribution Code

Initialize rows 0,n and columns 0,n of g and h

Iteration = 0

DO

FOR (i=1; i<n; i++)

FOR (j=1; j<n; j++)

IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4

ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4

iteration++

UNTIL max (|gi – hi|)<tolerance or iteration>MAX


Block or strip partitioning

p0

p1

p2

p3

p4

p5

p6

p7

p8

p9

p10

p11

p12

p13

p14

p15

Block or Strip Partitioning

Assign portions of the grid to processors in the topology

  • Block Partitioning

    • Eight messages exchanged at each iteration

    • Data exchanged per message is n/P1/2

  • Strip Partitioning

    • Four messages exchanged at each iteration

    • Data exchanged per message is n

  • Question: Which is better?

Blocks

p0

p1

p2

p3

p4

p5

p6

p7

Column Strips


Parallel implementation

Cells to north

Pi

Cells to east

Cells to west

Cells to south

Parallel Implementation

  • Algorithm Modifications

    • Declare “ghost” rows to hold adjacent data (10 x 10 for 8 x 8 block)

    • Exchange data with neighbor processors

    • Perform the calculation for the local grid cells


Heat distribution partitioning
Heat Distribution Partitioning

Main logic

For each iteration

For each point

compute new temperature

SendRcv(row-1,col,point)

SendRcv(row+1,col,point)

SendRcv(row,col-1,point)

SendRcv(row,col+1,point)

SendRcv(row,col)

if row,col is not local

if myrank even

Send(point,prow,col)

Recv(point,prow,col)

Else

Recv(point,prow,col)

Send(point,prow,col)


Fully synchronized example
Fully Synchronized Example

  • Data Parallel Computations

    • Simultaneously apply the same operation to different data

  • Sequential Code

    for (i=1; i<n; i++) a[i] = someFunction(a[i])

  • Shared Memory Code

    Forall (i=0; i<n; i++) {bodyOfInstructions}

    • In these cases, the for loop is a natural barrier

  • Distributed processing

    For local a[i]; {someFunction(a[i])}

    barrier();


Data parallel example
Data Parallel Example

A[] += k

A[0] += k

A[1] += k

A[n-1] += k

p0

p1

pn

  • All processors execute instructions in “lock step”

  • Forall (i=0; i<n; i++) a[i] += k

  • Note: Multicomputer configurations partition numbers in blocks


Prefix sum problem
Prefix Sum Problem

Note: Prefix Sum algorithm works for any associative operation

  • Definition: Given numbers a[i]; i=0; i<n, the prefix sum of a[i] is: a[i] += a[0] + a[1] + … + a[i-1]

  • Application: Radix Sort

  • Sequential codefor (j=0;j<lg(n);j++) for (i=2j; i<n; i++) a[i] += a[i-2j];

  • Parallel shared memory codefor (j=0; j<lg(n); j++) forall (i=2j; i<n; i++) a[i] += a[i-2j];

  • Parallel distributed memory code

    for (j=1; j<= log(n); j++)

    if (myrank>=2j-1 receive(sum, myrank – 2j-1)

    a[myrank] = a[myrank] += sum

    else send(a[myrank], myrank + 2j)



Synchronous iteration
Synchronous Iteration

  • Processes synchronize at each iteration step

  • Example: Simulation of Natural Processes

  • Shared memory code

    for (j=0; j<n; j++) forall (i=0; i<N; i++) body(i);

  • Distributed memory code

    for (j=0; j<n; j++)

    body(myRank);

    barrier();


Example n equations of n unknowns
Example: n equations of n unknowns

an-1,0x0 + an,1x1 …+ an,n-1xn-1 = bk∙∙∙

ak,0x0 + ak,1x1 …+ ak,n-1xn-1 = bk∙∙∙

a1,0x0 + a1,1x1 …+ a1,n-1xn-1 = b1

a0,0x0 + a0,1x1 …+ a0,n-1xn-1 = b0

  • Or rewrite equations as follows:

    xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k= (bk - ∑j≠kai,j xj)/ai,i


Jacobi iteration
Jacobi Iteration

xi

  • Jacobi Iteration

    xnewi = initial guess

    DO

    xi = xnewi

    xnewi = Calculated next guess

    UNTIL ∑i|xnewi – xi|<tolerance

  • Jacobi iteration always converges if:ak,k > ∑i≠k ai,0

Error

Iteration

i

i+1


Parallel jacobi code

xnew0

xnew1

xnewn-1

xi

Allgather() xnewi into xi

Parallel Jacobi Code

xi = bi

DO for each i

sum = -ai,i * xi

FOR (j=0; j<n; j++) sum += ai,i * xj

xnewi = (bi – sum)/ai,i

allgather(xnewi)

barrier()

Until iterations>MAX or ∑i|xnewi – xi|<tolerance


Additional jacobi notes
Additional Jacobi Notes

Time

  • What if P (processor count) < n?

    • Answer: Allocate blocks of variables to processors

  • Block Allocation

    • Allocate consecutive xi to processors

  • Cyclic Allocation

    • Allocate x0, xP, … to p0

    • Allocate x1, xp+1, … to p1 … etc.

  • Question: Which allocation scheme is better?

Computation

Communication

4 8 12 16 20 24

Processors

Jacobi Performance


Cellular automata
Cellular Automata

Definition

  • The System has a finite grid of cells

  • Each cell can assume a finite number of states

  • Neighbor cells affect a cell according to rule set

  • All cell changes of state occur simultaneously

  • The system iterates through a number of generations

  • Serious Applications:

    • Fluid and gas dynamics

    • Biological growth

    • Airplane wing airflow

    • Erosion modeling

    • Groundwater pollution


Conway s game of life
Conway’s Game of Life

  • The grid is a two dimension array of cells

    • The grid ends can optionally wrap around (like a torus)

  • Each cell

    • Can hold one “organism”

    • There are eight neighbor cells

      • North, Northeast, East, Southeast, South, Southwest, West, Northwest

  • Rules (run the simulation over many generations)

    • Organism dies from loneliness if 0 or 1 organisms live in neighbor cells

    • Organism survives if 2 organisms live in adjacent cells

    • An empty cell with 3 living neighbors gives birth to organisms in every empty adjacent cell

    • Organism dies from overpopulation >= 4 organisms live in neighbor cells


Sharks and fishes
Sharks and Fishes

  • The grid (ocean) is modeled by a three dimension array

    • The grid ends can optionally wrap around (like a torus)

  • Each cell

    • Can hold either a fish or a shark, but not both

    • There are twenty six neighbor cells

  • Rules for fish

    • Fish move randomly to empty adjacent cells

    • If there are no empty adjacent cells, fish stay put

    • Fish of breeding age leave a baby fish in the vacating cell

    • Fish die after x generations

  • Rules for sharks

    • Sharks randomly move to adjacent cells with fish, eating the fish

    • If no adjacent cells have fish, the shark moves randomly to empty cells. It stays put if there are no empty cells

    • Sharks of breeding age leave a baby shark in the vacating cell

    • Sharks that die if they don’t eat a fish for y generations


ad