1 / 45

# Compiling High Performance Fortran - PowerPoint PPT Presentation

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Compiling High Performance Fortran' - shalom

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

PROGRAM SUM

REAL A(10000)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

Motivation for HPF

MPI implementation

• User has to rewrite the program in SPMD form [Single Program Multiple Data]

• User has to manage data movement [send & receive], data placement and synchronization

• Too messy and not easy to master

• Approach 2: Use HPF

• HPF is an extended version of Fortran 90

• HPF has Fortran 90 features and a few directives

• Directives

• Tell how data is laid out in processor memories in parallel machine configuration. For example,

• !HPF DISTRIBUTE A(BLOCK)

• Assist in identifying parallelism. For example,

• !HPF INDEPENDENT

PROGRAM SUM

REAL A(10000)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

Minimum modification

Easy to write

Now compiler has to do more work

Motivation for HPF

• User needs only to write some easy directives; need not write the whole program in SPMD form

• User does not need to manage data movement [send & receive] and synchronization

• Simple and easy to master

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Used for communication analysis

Fact used: No dependence carried by I loop

Running example:

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

HPF Compilation Overview

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

HPF Compilation Overview

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Partition so as to distribute work of the I loops

HPF Compilation Overview

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Communication reqd for B(0)for each iteration

HPF Compilation Overview

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Optimization

Aggregation

Overlap communication and computation

Recognition of reduction

HPF Compilation Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

• Distribution Propagation and analysis

• Analyze what distribution holds for a given array at a given point in the program

• Difficult due to

• REALIGN and REDISTRIBUTE directives

• Distribution of formal parameters inherited from calling procedure

• Use “Reaching Decompositions” data flow analysis and its interprocedural version

• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors

• Block size,

Dividing work among processors

Computation partitioning

Determine which iterations of a loop will be executed on which processor

Owner-computes rule

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

Iteration I is executed on owner of A(I)

100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Basic Loop Compilation

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)

!HPF\$ DISTRIBUTE A(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

REAL A(10000),B(10000)

!HPF\$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Map global iteration space, I to local iteration space,i as follows:

• Adjust array subscripts for local iterations:

• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

REAL A(10000), B(10000)

!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Communication Generation

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication Generation

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication Generation

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

Communication Generation

• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences

from S1 to S2

!HPF\$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2: IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

Rearrangement won’t be correct

Communication Generation

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

!HPF\$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication Vectorization

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

Communication Vectorization

Distribute J Loop

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

Communication Vectorization

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1: IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2: B(0,J)=Bg(0,J)

S3: A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4: A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Communication Vectorization

!HPF\$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be vectorized?

REAL A(10000,100)

!HPF\$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be fully vectorized?

Communication Vectorization

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0: IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1: IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1: DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0: IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1: DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1: IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

Overlapping Communication and Computation

!HPF\$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

Pipelining

Can be vectorized

But gives up parallelism

• Pipelined parallelism with communication

• Pipelined parallelism with communication overhead

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

Pipelining: Blocking

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement

• Minimize temporary storage used for communication

• Space taken for temporary storage should be at most equal to the space taken by the arrays

• Interprocedural Optimizations

• HPF is easy to code

• But hard to compile

• Steps required to compile HPF programs

• Basic loop compilation

• Communication generation

• Optimizations

• Communication vectorization

• Overlapping communication with computation

• Pipelining