- 47 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Definitions' - byron-osborne

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Definitions

- A Synchronous application is one where all processes must reach certain points before execution continues.
- Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues.
- A barrier is the basic message passing mechanism for synchronizing processes.
- Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.

P0

Barrier

P0

Executing

P0

Waiting

P0

Barrier IllustrationC: MPI_Barrier(MPI_COMM_WORLD);

mpiJava: MPI.COMM_WORLD.Barrier();

Counter (linear) Barrier

Master Processor

For (i=0; i<P; i++) // Arrival Phase

Receive null message from any processor

For (i=0; i<P; i++) // Departure Phase

Send null message to release slaves

Slave Processors

Send null message to enter barrier

Receive null message for barrier release

Note: This logic avoids processors arriving before prior release

P0

P1

P2

P3

P4

P5

P6

P7

Tree (non-linear) BarrierP0

P1

P2

P3

P4

P5

P6

Release Phase

P7

Entry Phase

Note: Implementation logic is similar to divide and conquer

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

P0

P1

P2

P3

P4

P5

P6

P7

Barrier Barrier- Stage 1: P0p1; p2p3; p4p5; p6p7
- Stage 2: p0p2; p1p3; p4p6; p5p7
- Stage 3: p0p4; p1p5; p2p6; p3p7

Local Synchronization

Synchronize with neighbors before proceeding

- Even Processors
Send null message to processor i-1

Receive null message from processor i-1

Send null message to processor i+1

Receive null message from processor i+1

- Odd Numbered Processors
Receive null message from processor i+1

Send null message to processor i+1

Receive null message from processor i-1

Send null message to processor i-1

- Notes:
- Local Synchronization is an incomplete barrier
- processors exit after receiving messages from their neighbors

- Deadlock can occur if the message passing order is incorrect.
- MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free

- Local Synchronization is an incomplete barrier

Local Synchronization Example

- Heat Distribution Problem
- Goal
- Determine final temperature at each n x n grid point

- Initial boundary condition
- Know initial temperatures at the end points

- Cannot proceed to next iteration until local synchronization completes
DO

Average each grid point with its neighbors

UNTIL temperature changes are small enough

- Goal

New Value =

(∑neighbors)/4

Sequential Heat Distribution Code

Initialize rows 0,n and columns 0,n of g and h

Iteration = 0

DO

FOR (i=1; i<n; i++)

FOR (j=1; j<n; j++)

IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4

ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4

iteration++

UNTIL max (|gi – hi|)<tolerance or iteration>MAX

p0

p1

p2

p3

p4

p5

p6

p7

p8

p9

p10

p11

p12

p13

p14

p15

Block or Strip PartitioningAssign portions of the grid to processors in the topology

- Block Partitioning
- Eight messages exchanged at each iteration
- Data exchanged per message is n/P1/2

- Strip Partitioning
- Four messages exchanged at each iteration
- Data exchanged per message is n

- Question: Which is better?

Blocks

p0

p1

p2

p3

p4

p5

p6

p7

Column Strips

Pi

Cells to east

Cells to west

Cells to south

Parallel Implementation- Algorithm Modifications
- Declare “ghost” rows to hold adjacent data (10 x 10 for 8 x 8 block)
- Exchange data with neighbor processors
- Perform the calculation for the local grid cells

Heat Distribution Partitioning

Main logic

For each iteration

For each point

compute new temperature

SendRcv(row-1,col,point)

SendRcv(row+1,col,point)

SendRcv(row,col-1,point)

SendRcv(row,col+1,point)

SendRcv(row,col)

if row,col is not local

if myrank even

Send(point,prow,col)

Recv(point,prow,col)

Else

Recv(point,prow,col)

Send(point,prow,col)

Fully Synchronized Example

- Data Parallel Computations
- Simultaneously apply the same operation to different data

- Sequential Code
for (i=1; i<n; i++) a[i] = someFunction(a[i])

- Shared Memory Code
Forall (i=0; i<n; i++) {bodyOfInstructions}

- In these cases, the for loop is a natural barrier

- Distributed processing
For local a[i]; {someFunction(a[i])}

barrier();

Data Parallel Example

A[] += k

A[0] += k

A[1] += k

A[n-1] += k

p0

p1

pn

- All processors execute instructions in “lock step”
- Forall (i=0; i<n; i++) a[i] += k
- Note: Multicomputer configurations partition numbers in blocks

Prefix Sum Problem

Note: Prefix Sum algorithm works for any associative operation

- Definition: Given numbers a[i]; i=0; i<n, the prefix sum of a[i] is: a[i] += a[0] + a[1] + … + a[i-1]
- Application: Radix Sort
- Sequential codefor (j=0;j<lg(n);j++) for (i=2j; i<n; i++) a[i] += a[i-2j];
- Parallel shared memory codefor (j=0; j<lg(n); j++) forall (i=2j; i<n; i++) a[i] += a[i-2j];
- Parallel distributed memory code
for (j=1; j<= log(n); j++)

if (myrank>=2j-1 receive(sum, myrank – 2j-1)

a[myrank] = a[myrank] += sum

else send(a[myrank], myrank + 2j)

Synchronous Iteration

- Processes synchronize at each iteration step
- Example: Simulation of Natural Processes
- Shared memory code
for (j=0; j<n; j++) forall (i=0; i<N; i++) body(i);

- Distributed memory code
for (j=0; j<n; j++)

body(myRank);

barrier();

Example: n equations of n unknowns

an-1,0x0 + an,1x1 …+ an,n-1xn-1 = bk∙∙∙

ak,0x0 + ak,1x1 …+ ak,n-1xn-1 = bk∙∙∙

a1,0x0 + a1,1x1 …+ a1,n-1xn-1 = b1

a0,0x0 + a0,1x1 …+ a0,n-1xn-1 = b0

- Or rewrite equations as follows:
xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k= (bk - ∑j≠kai,j xj)/ai,i

Jacobi Iteration

xi

- Jacobi Iteration
xnewi = initial guess

DO

xi = xnewi

xnewi = Calculated next guess

UNTIL ∑i|xnewi – xi|<tolerance

- Jacobi iteration always converges if:ak,k > ∑i≠k ai,0

Error

Iteration

i

i+1

xnew0

xnew1

xnewn-1

xi

Allgather() xnewi into xi

Parallel Jacobi Codexi = bi

DO for each i

sum = -ai,i * xi

FOR (j=0; j<n; j++) sum += ai,i * xj

xnewi = (bi – sum)/ai,i

allgather(xnewi)

barrier()

Until iterations>MAX or ∑i|xnewi – xi|<tolerance

Additional Jacobi Notes

Time

- What if P (processor count) < n?
- Answer: Allocate blocks of variables to processors

- Block Allocation
- Allocate consecutive xi to processors

- Cyclic Allocation
- Allocate x0, xP, … to p0
- Allocate x1, xp+1, … to p1 … etc.

- Question: Which allocation scheme is better?

Computation

Communication

4 8 12 16 20 24

Processors

Jacobi Performance

Cellular Automata

Definition

- The System has a finite grid of cells
- Each cell can assume a finite number of states
- Neighbor cells affect a cell according to rule set
- All cell changes of state occur simultaneously
- The system iterates through a number of generations

- Serious Applications:
- Fluid and gas dynamics
- Biological growth
- Airplane wing airflow
- Erosion modeling
- Groundwater pollution

Conway’s Game of Life

- The grid is a two dimension array of cells
- The grid ends can optionally wrap around (like a torus)

- Each cell
- Can hold one “organism”
- There are eight neighbor cells
- North, Northeast, East, Southeast, South, Southwest, West, Northwest

- Rules (run the simulation over many generations)
- Organism dies from loneliness if 0 or 1 organisms live in neighbor cells
- Organism survives if 2 organisms live in adjacent cells
- An empty cell with 3 living neighbors gives birth to organisms in every empty adjacent cell
- Organism dies from overpopulation >= 4 organisms live in neighbor cells

Sharks and Fishes

- The grid (ocean) is modeled by a three dimension array
- The grid ends can optionally wrap around (like a torus)

- Each cell
- Can hold either a fish or a shark, but not both
- There are twenty six neighbor cells

- Rules for fish
- Fish move randomly to empty adjacent cells
- If there are no empty adjacent cells, fish stay put
- Fish of breeding age leave a baby fish in the vacating cell
- Fish die after x generations

- Rules for sharks
- Sharks randomly move to adjacent cells with fish, eating the fish
- If no adjacent cells have fish, the shark moves randomly to empty cells. It stays put if there are no empty cells
- Sharks of breeding age leave a baby shark in the vacating cell
- Sharks that die if they don’t eat a fish for y generations

Download Presentation

Connecting to Server..