This presentation is the property of its rightful owner.
1 / 45

# Compiling High Performance Fortran PowerPoint PPT Presentation

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF.

Compiling High Performance Fortran

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

### Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

### Motivation for HPF

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

### Motivation for HPF

MPI implementation

### Motivation for HPF

• User has to rewrite the program in SPMD form [Single Program Multiple Data]

• User has to manage data movement [send & receive], data placement and synchronization

• Too messy and not easy to master

### Motivation for HPF

• Approach 2: Use HPF

• HPF is an extended version of Fortran 90

• HPF has Fortran 90 features and a few directives

• Directives

• Tell how data is laid out in processor memories in parallel machine configuration. For example,

• !HPF DISTRIBUTE A(BLOCK)

• Assist in identifying parallelism. For example,

• !HPF INDEPENDENT

The same sum reduction code

PROGRAM SUM

REAL A(10000)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

Minimum modification

Easy to write

Now compiler has to do more work

### Motivation for HPF

• User needs only to write some easy directives; need not write the whole program in SPMD form

• User does not need to manage data movement [send & receive] and synchronization

• Simple and easy to master

### Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Dependence Analysis

Used for communication analysis

Fact used: No dependence carried by I loop

Running example:

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

### HPF Compilation Overview

Running example:

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

### HPF Compilation Overview

Running example:

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Partition so as to distribute work of the I loops

### HPF Compilation Overview

REAL A(1,100), B(0:100)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1:IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Communication reqd for B(0)for each iteration

### HPF Compilation Overview

REAL A(1,100), B(0:100)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Optimization

Aggregation

Overlap communication and computation

Recognition of reduction

### Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

### Basic Loop Compilation

• Distribution Propagation and analysis

• Analyze what distribution holds for a given array at a given point in the program

• Difficult due to

• REALIGN and REDISTRIBUTE directives

• Distribution of formal parameters inherited from calling procedure

• Use “Reaching Decompositions” data flow analysis and its interprocedural version

### Basic Loop Compilation

• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors

• Block size,

Iteration Partitioning

Dividing work among processors

Computation partitioning

Determine which iterations of a loop will be executed on which processor

Owner-computes rule

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

Iteration I is executed on owner of A(I)

100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

### Iteration Partitioning

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

### Iteration Partitioning

• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

### Iteration Partitioning

REAL A(10000),B(10000)

!HPF\$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Map global iteration space, I to local iteration space,i as follows:

### Iteration Partitioning

• Adjust array subscripts for local iterations:

### Iteration Partitioning

• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO

### Communication Generation

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

### Communication Generation

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

### Communication Generation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

### Communication Generation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

### Communication Generation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

### Communication Generation

• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location

IF (PID <= lastP) THEN

S1:IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2:IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences

from S1 to S2

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1:IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2:IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

Rearrangement won’t be correct

### Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

REAL A(10000,100)

!HPF\$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

### Communication Vectorization

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

### Communication Vectorization

Distribute J Loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

### Communication Vectorization

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1:IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2:B(0,J)=Bg(0,J)

S3:A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4:A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

### Communication Vectorization

REAL A(10000,100)

!HPF\$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be vectorized?

REAL A(10000,100)

!HPF\$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be fully vectorized?

### Communication Vectorization

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

### Overlapping Communication and Computation

REAL A(10000,100)

!HPF\$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

### Pipelining

Can be vectorized

But gives up parallelism

### Pipelining

• Pipelined parallelism with communication

### Pipelining

• Pipelined parallelism with communication overhead

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

### Other Optimizations

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement

• Minimize temporary storage used for communication

• Space taken for temporary storage should be at most equal to the space taken by the arrays

• Interprocedural Optimizations

### Summary

• HPF is easy to code

• But hard to compile

• Steps required to compile HPF programs

• Basic loop compilation

• Communication generation

• Optimizations

• Communication vectorization

• Overlapping communication with computation

• Pipelining