Compiling high performance fortran
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Compiling High Performance Fortran PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on
  • Presentation posted in: General

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF.

Download Presentation

Compiling High Performance Fortran

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Compiling high performance fortran

Compiling High Performance Fortran

Allen and Kennedy, Chapter 14


Overview

Overview

  • Motivation for HPF

  • Overview of compiling HPF programs

  • Basic Loop Compilation for HPF

  • Optimizations for compiling HPF

  • Results and Summary


Motivation for hpf

Motivation for HPF

  • Require “Message Passing” to communicate data between processors

  • Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor


Motivation for hpf1

Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

Motivation for HPF

MPI implementation


Motivation for hpf2

Motivation for HPF

  • Disadvantages of MPI approach

    • User has to rewrite the program in SPMD form [Single Program Multiple Data]

    • User has to manage data movement [send & receive], data placement and synchronization

    • Too messy and not easy to master


Motivation for hpf3

Motivation for HPF

  • Approach 2: Use HPF

    • HPF is an extended version of Fortran 90

    • HPF has Fortran 90 features and a few directives

  • Directives

    • Tell how data is laid out in processor memories in parallel machine configuration. For example,

      • !HPF DISTRIBUTE A(BLOCK)

    • Assist in identifying parallelism. For example,

      • !HPF INDEPENDENT


Motivation for hpf4

The same sum reduction code

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

Minimum modification

Easy to write

Now compiler has to do more work

Motivation for HPF


Motivation for hpf5

Motivation for HPF

  • Advantages of HPF

    • User needs only to write some easy directives; need not write the whole program in SPMD form

    • User does not need to manage data movement [send & receive] and synchronization

    • Simple and easy to master


Overview1

Overview

  • Motivation for HPF

  • Overview of compiling HPF programs

  • Basic Loop Compilation for HPF

  • Optimizations for compiling HPF

  • Results and Summary


Hpf compilation overview

Dependence Analysis

Used for communication analysis

Fact used: No dependence carried by I loop

Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

HPF Compilation Overview


Hpf compilation overview1

Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

HPF Compilation Overview


Hpf compilation overview2

Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Partition so as to distribute work of the I loops

HPF Compilation Overview


Hpf compilation overview3

REAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1:IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Communication reqd for B(0)for each iteration

Shadow region B(0)

HPF Compilation Overview


Hpf compilation overview4

REAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1:A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2:B(I) = A(I)

ENDDO

ENDDO

Dependence Analysis

Distribution Analysis

Computation Partitioning

Communication Analysis and placement

Optimization

Aggregation

Overlap communication and computation

Recognition of reduction

HPF Compilation Overview


Overview2

Overview

  • Motivation for HPF

  • Overview of compiling HPF programs

  • Basic Loop Compilation for HPF

  • Optimizations for compiling HPF

  • Results and Summary


Basic loop compilation

Basic Loop Compilation

  • Distribution Propagation and analysis

    • Analyze what distribution holds for a given array at a given point in the program

    • Difficult due to

      • REALIGN and REDISTRIBUTE directives

      • Distribution of formal parameters inherited from calling procedure

    • Use “Reaching Decompositions” data flow analysis and its interprocedural version


Basic loop compilation1

Basic Loop Compilation

  • For simplicity assume single distribution for an array at all points in a subprogram

  • Define

  • For example suppose array A of size N is block distributed over p processors

    • Block size,


Basic loop compilation2

Iteration Partitioning

Dividing work among processors

Computation partitioning

Determine which iterations of a loop will be executed on which processor

Owner-computes rule

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

Iteration I is executed on owner of A(I)

100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Basic Loop Compilation


Iteration partitioning

Iteration Partitioning

  • Multiple statements in a loop in a recurrence: choose a partitioning reference

  • Processor responsible for performing computation for iteration I is

  • Set of indices executed on p


Iteration partitioning1

Iteration Partitioning

  • Have to map global loop index to local loop index

  • Smallest value in maps to 1

    REAL A(10000)

    !HPF$ DISTRIBUTE A(BLOCK)

    DO I = 1, N

    A(I+1) = B(I) + C

    ENDDO


Iteration partitioning2

Iteration Partitioning

REAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

  • Map global iteration space, I to local iteration space,i as follows:


Iteration partitioning3

Iteration Partitioning

  • Adjust array subscripts for local iterations:


Iteration partitioning4

Iteration Partitioning

  • For interior processors the code becomes..

    DO i = 1, 100

    A(i) = B(i-1) + C

    ENDDO

  • Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

    lo = 1

    IF (PID==0) lo = 2

    hi = 100

    IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

    DO i = lo, hi

    A(i) = B(i-1) + C

    ENDDO


Communication generation

Communication Generation

  • For our example no communication is required for iterations in

  • Iterations which require receiving data are

  • Iterations which require sending data are


Communication generation1

Communication Generation

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

  • Receive required for iterations in [100p:100p]

  • Send required for iterations in [100p+100:100p+100]

  • No communication required for iterations in [100p+1:100p+99]


Communication generation2

After inserting receive

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Communication Generation


Communication generation3

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication Generation


Communication generation4

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Communication Generation


Communication generation5

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

Communication Generation


Communication generation6

Communication Generation

  • When is such rearrangement legal?

  • Receive: copy from global to local location

  • Send: copy local to global location

    IF (PID <= lastP) THEN

    S1:IF (lo == 1 && PID /= 0) THEN

    B(0) = Bg(0) ! RECV

    A(1) = B(0) + C

    ENDIF

    DO i = 2, hi

    A(i) = B(i-1) + C

    ENDDO

    S2:IF (PID /= lastP) Bg(100) = B(100) ! SEND

    ENDIF

No chain of dependences

from S1 to S2


Communication generation7

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1:IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2:IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

Rearrangement won’t be correct

Communication Generation


Overview3

Overview

  • Motivation for HPF

  • Overview of compiling HPF programs

  • Basic Loop Compilation for HPF

  • Optimizations for compiling HPF

  • Results and Summary


Communication vectorization

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication Vectorization


Communication vectorization1

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

Communication Vectorization

Distribute J Loop


Communication vectorization2

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

Communication Vectorization


Communication vectorization3

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1:IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2:B(0,J)=Bg(0,J)

S3:A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4:A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Communication Vectorization


Communication vectorization4

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

Can sends be done before the receives?

Can communication be fully vectorized?

Communication Vectorization


Overlapping communication and computation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

Overlapping Communication and Computation


Pipelining

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

Pipelining

Can be vectorized

But gives up parallelism


Pipelining1

Pipelining

  • Pipelined parallelism with communication


Pipelining2

Pipelining

  • Pipelined parallelism with communication overhead


Pipelining blocking

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

Pipelining: Blocking


Other optimizations

Other Optimizations

  • Alignment and Replication

  • Identification of Common recurrences

  • Storage Mangement

    • Minimize temporary storage used for communication

    • Space taken for temporary storage should be at most equal to the space taken by the arrays

  • Interprocedural Optimizations


Results

Results


Summary

Summary

  • HPF is easy to code

    • But hard to compile

  • Steps required to compile HPF programs

    • Basic loop compilation

      • Communication generation

    • Optimizations

      • Communication vectorization

      • Overlapping communication with computation

      • Pipelining


  • Login