1 / 45

Compiling High Performance Fortran

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF.

shalom
Download Presentation

Compiling High Performance Fortran

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling High Performance Fortran Allen and Kennedy, Chapter 14

  2. Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary

  3. Motivation for HPF • Require “Message Passing” to communicate data between processors • Approach 1: Use MPI calls in Fortran/C code Scalable Distributed Memory Multiprocessor

  4. Consider the following sum reduction PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END PROGRAM SUM REAL A(100), BUFF(100) IF (PID == 0) THEN DO IP = 0, 99 READ (9) BUFF(1:100) IF (IP == 0) A(1:100) = BUFF(1:100) ELSE SEND(IP,BUFF,100) ENDDO ELSE RECV(0,A,100) ENDIF /*Actual sum reduction code here */ IF (PID == 0) SEND(1,SUM,1) IF (PID > 0) RECV(PID-1,T,1) SUM = SUM + T IF (PID < 99) SEND(PID+1,SUM,1) ELSE SEND(0,SUM,1) ENDIF IF (PID == 0) PRINT SUM; END Motivation for HPF MPI implementation

  5. Motivation for HPF • Disadvantages of MPI approach • User has to rewrite the program in SPMD form [Single Program Multiple Data] • User has to manage data movement [send & receive], data placement and synchronization • Too messy and not easy to master

  6. Motivation for HPF • Approach 2: Use HPF • HPF is an extended version of Fortran 90 • HPF has Fortran 90 features and a few directives • Directives • Tell how data is laid out in processor memories in parallel machine configuration. For example, • !HPF DISTRIBUTE A(BLOCK) • Assist in identifying parallelism. For example, • !HPF INDEPENDENT

  7. The same sum reduction code PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END When written in HPF... PROGRAM SUM REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END Minimum modification Easy to write Now compiler has to do more work Motivation for HPF

  8. Motivation for HPF • Advantages of HPF • User needs only to write some easy directives; need not write the whole program in SPMD form • User does not need to manage data movement [send & receive] and synchronization • Simple and easy to master

  9. Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary

  10. Dependence Analysis Used for communication analysis Fact used: No dependence carried by I loop Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO HPF Compilation Overview

  11. Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis HPF Compilation Overview

  12. Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Partition so as to distribute work of the I loops HPF Compilation Overview

  13. REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Communication reqd for B(0)for each iteration Shadow region B(0) HPF Compilation Overview

  14. REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Optimization Aggregation Overlap communication and computation Recognition of reduction HPF Compilation Overview

  15. Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary

  16. Basic Loop Compilation • Distribution Propagation and analysis • Analyze what distribution holds for a given array at a given point in the program • Difficult due to • REALIGN and REDISTRIBUTE directives • Distribution of formal parameters inherited from calling procedure • Use “Reaching Decompositions” data flow analysis and its interprocedural version

  17. Basic Loop Compilation • For simplicity assume single distribution for an array at all points in a subprogram • Define • For example suppose array A of size N is block distributed over p processors • Block size,

  18. Iteration Partitioning Dividing work among processors Computation partitioning Determine which iterations of a loop will be executed on which processor Owner-computes rule REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, 10000 A(I) = A(I) + C ENDDO Iteration I is executed on owner of A(I) 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on Basic Loop Compilation

  19. Iteration Partitioning • Multiple statements in a loop in a recurrence: choose a partitioning reference • Processor responsible for performing computation for iteration I is • Set of indices executed on p

  20. Iteration Partitioning • Have to map global loop index to local loop index • Smallest value in maps to 1 REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO

  21. Iteration Partitioning REAL A(10000),B(10000) !HPF$ DISTRIBUTE A(BLOCK),B(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO • Map global iteration space, I to local iteration space,i as follows:

  22. Iteration Partitioning • Adjust array subscripts for local iterations:

  23. Iteration Partitioning • For interior processors the code becomes.. DO i = 1, 100 A(i) = B(i-1) + C ENDDO • Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions.. lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi A(i) = B(i-1) + C ENDDO

  24. Communication Generation • For our example no communication is required for iterations in • Iterations which require receiving data are • Iterations which require sending data are

  25. Communication Generation REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) ... DO I = 1, N A(I+1) = B(I) + C ENDDO • Receive required for iterations in [100p:100p] • Send required for iterations in [100p+100:100p+100] • No communication required for iterations in [100p+1:100p+99]

  26. After inserting receive lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL((N+1)/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO Send must happen in the 101st iteration lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Communication Generation

  27. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Move SEND outside the loop lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation

  28. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Move receive outside loop and loop peel lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation

  29. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100), 1) IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF Communication Generation

  30. Communication Generation • When is such rearrangement legal? • Receive: copy from global to local location • Send: copy local to global location IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN B(0) = Bg(0) ! RECV A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND ENDIF No chain of dependences from S1 to S2

  31. REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK) ... DO I = 1, N A(I+1) = A(I) + C ENDDO Would be rewritten as .. IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN A(0) = Ag(0) ! RECV A(1) = A(0) + C ENDIF DO i = 2, hi A(i) = A(i-1) + C ENDDO S2: IF (PID /= lastP) Ag(100) = A(100) ! SEND ENDIF Rearrangement won’t be correct Communication Generation

  32. Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary

  33. REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = B(I,J) + C ENDDO ENDDO Using Basic Loop compilation gives.. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication Vectorization

  34. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF Communication Vectorization Distribute J Loop

  35. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1) THEN RECV (PID-1, B(0,1:M), M) DO J = 1, M A(1,J) = B(0,J) + C ENDDO ENDIF DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, B(100,1:M), M) ENDIF Communication Vectorization

  36. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S1: IF (PID /= lastP) Bg(100,J)=B(100,J) IF (lo == 1) THEN S2: B(0,J)=Bg(0,J) S3: A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi S4: A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop Communication Vectorization

  37. REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + B(I,J) ENDDO ENDDO Can sends be done before the receives? Can communication be vectorized? REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J+1) = A(I,J) + C ENDDO ENDDO Can sends be done before the receives? Can communication be fully vectorized? Communication Vectorization

  38. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ENDIF Overlapping Communication and Computation

  39. REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + C ENDDO ENDDO Initial code generation for the I loop gives.. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF Pipelining Can be vectorized But gives up parallelism

  40. Pipelining • Pipelined parallelism with communication

  41. Pipelining • Pipelined parallelism with communication overhead

  42. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF ... IF (PID <= lastP) THEN DO J = 1, M, K IF (lo == 1) THEN RECV (PID-1, A(0,J:J+K-1), K) DO j = J, J+K-1 A(1,J) = A(0,J) + C ENDDO ENDIF DO j = J, J+K-1 DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J:J+K-1),K) ENDDO ENDIF Pipelining: Blocking

  43. Other Optimizations • Alignment and Replication • Identification of Common recurrences • Storage Mangement • Minimize temporary storage used for communication • Space taken for temporary storage should be at most equal to the space taken by the arrays • Interprocedural Optimizations

  44. Results

  45. Summary • HPF is easy to code • But hard to compile • Steps required to compile HPF programs • Basic loop compilation • Communication generation • Optimizations • Communication vectorization • Overlapping communication with computation • Pipelining

More Related