Loading in 5 sec....

Compiling High Performance FortranPowerPoint Presentation

Compiling High Performance Fortran

- 119 Views
- Uploaded on
- Presentation posted in: General

Compiling High Performance Fortran

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

Overview

- Motivation for HPF
- Overview of compiling HPF programs
- Basic Loop Compilation for HPF
- Optimizations for compiling HPF
- Results and Summary

Motivation for HPF

- Require “Message Passing” to communicate data between processors
- Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

Motivation for HPFMPI implementation

Motivation for HPF

- Disadvantages of MPI approach
- User has to rewrite the program in SPMD form [Single Program Multiple Data]
- User has to manage data movement [send & receive], data placement and synchronization
- Too messy and not easy to master

Motivation for HPF

- Approach 2: Use HPF
- HPF is an extended version of Fortran 90
- HPF has Fortran 90 features and a few directives

- Directives
- Tell how data is laid out in processor memories in parallel machine configuration. For example,
- !HPF DISTRIBUTE A(BLOCK)

- Assist in identifying parallelism. For example,
- !HPF INDEPENDENT

- Tell how data is laid out in processor memories in parallel machine configuration. For example,

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

Minimum modification

Easy to write

Now compiler has to do more work

Motivation for HPFMotivation for HPF

- Advantages of HPF
- User needs only to write some easy directives; need not write the whole program in SPMD form
- User does not need to manage data movement [send & receive] and synchronization
- Simple and easy to master

Overview

- Motivation for HPF
- Overview of compiling HPF programs
- Basic Loop Compilation for HPF
- Optimizations for compiling HPF
- Results and Summary

Used for communication analysis

Fact used: No dependence carried by I loop

Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

HPF Compilation OverviewREAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

HPF Compilation OverviewREAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Partition so as to distribute work of the I loops

HPF Compilation Overview!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1:IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Communication reqd for B(0)for each iteration

Shadow region B(0)

HPF Compilation Overview!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Optimization

Aggregation

Overlap communication and computation

Recognition of reduction

HPF Compilation OverviewOverview

- Motivation for HPF
- Overview of compiling HPF programs
- Basic Loop Compilation for HPF
- Optimizations for compiling HPF
- Results and Summary

Basic Loop Compilation

- Distribution Propagation and analysis
- Analyze what distribution holds for a given array at a given point in the program
- Difficult due to
- REALIGN and REDISTRIBUTE directives
- Distribution of formal parameters inherited from calling procedure

- Use “Reaching Decompositions” data flow analysis and its interprocedural version

Basic Loop Compilation

- For simplicity assume single distribution for an array at all points in a subprogram
- Define
- For example suppose array A of size N is block distributed over p processors
- Block size,

Dividing work among processors

Computation partitioning

Determine which iterations of a loop will be executed on which processor

Owner-computes rule

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

Iteration I is executed on owner of A(I)

100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Basic Loop CompilationIteration Partitioning

- Multiple statements in a loop in a recurrence: choose a partitioning reference
- Processor responsible for performing computation for iteration I is
- Set of indices executed on p

Iteration Partitioning

- Have to map global loop index to local loop index
- Smallest value in maps to 1
REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

Iteration Partitioning

REAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

- Map global iteration space, I to local iteration space,i as follows:

Iteration Partitioning

- Adjust array subscripts for local iterations:

Iteration Partitioning

- For interior processors the code becomes..
DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

- Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..
lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO

Communication Generation

- For our example no communication is required for iterations in
- Iterations which require receiving data are
- Iterations which require sending data are

Communication Generation

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

- Receive required for iterations in [100p:100p]
- Send required for iterations in [100p+100:100p+100]
- No communication required for iterations in [100p+1:100p+99]

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Communication GenerationIF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication GenerationIF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication GenerationIF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

Communication GenerationCommunication Generation

- When is such rearrangement legal?
- Receive: copy from global to local location
- Send: copy local to global location
IF (PID <= lastP) THEN

S1:IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2:IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences

from S1 to S2

!HPF$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1:IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2:IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

Rearrangement won’t be correct

Communication Generation- Motivation for HPF
- Overview of compiling HPF programs
- Basic Loop Compilation for HPF
- Optimizations for compiling HPF
- Results and Summary

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

Communication VectorizationDistribute J Loop

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1:IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2:B(0,J)=Bg(0,J)

S3:A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4:A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Communication Vectorization!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be fully vectorized?

Communication VectorizationIF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

Overlapping Communication and Computation!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

PipeliningCan be vectorized

But gives up parallelism

Pipelining

- Pipelined parallelism with communication

Pipelining

- Pipelined parallelism with communication overhead

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

Pipelining: BlockingOther Optimizations

- Alignment and Replication
- Identification of Common recurrences
- Storage Mangement
- Minimize temporary storage used for communication
- Space taken for temporary storage should be at most equal to the space taken by the arrays

- Interprocedural Optimizations

Summary

- HPF is easy to code
- But hard to compile

- Steps required to compile HPF programs
- Basic loop compilation
- Communication generation

- Optimizations
- Communication vectorization
- Overlapping communication with computation
- Pipelining

- Basic loop compilation