1 / 43

High Performance on the J90 Systems

High Performance on the J90 Systems. David Turner & Tom DeBoni NERSC User Services Group April 1999. Philosophical Ramblings. Design for optimization? Where to start? When to stop?. J90 Potential. STREAM benchmark results Sustainable memory bandwidth (http://www.cs.virginia.edu/stream)

axel
Download Presentation

High Performance on the J90 Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999

  2. Philosophical Ramblings Design for optimization? Where to start? When to stop? 13 April, 1999 High Performance on the J90 Systems 2

  3. J90 Potential STREAM benchmark results Sustainable memory bandwidth (http://www.cs.virginia.edu/stream) John McCalpin, SGI bytes/iter FLOPS/iter COPY a(i)=b(i) 16 0 TRIAD a(i)=b(i)+q*c(i) 24 2 13 April, 1999 High Performance on the J90 Systems 3

  4. STREAM Results Machine ncpus COPY TRIAD MFLOPS Cray_C90 16 105497.0 103812.0 8651.0 Cray_C90 8 55071.9 63229.6 5269.1 Cray_C90 1 6965.4 9500.7 791.7 Cray_J932 16 16298.2 14995.9 1249.7 Cray_J932 8 9995.2 8941.3 745.1 Cray_J932 1 1433.6 1270.0 105.8 Cray_T3E-900 16 7497.0 8828.0 735.7 Cray_T3E-900 8 3747.0 4471.0 372.6 Cray_T3E-900 1 484.0 568.0 47.3 SGI_Origin_2K 16 5560.0 5240.0 436.7 SGI_Origin_2K 8 2570.0 2740.0 228.3 SGI_Origin_2K 1 332.0 358.0 29.8 Sun_UE_10000 16 2371.0 2905.0 242.1 Sun_UE_10000 8 1271.0 1546.0 128.8 Sun_UE_10000 1 164.0 202.0 16.8 13 April, 1999 High Performance on the J90 Systems 4

  5. STREAM Results (cont.) Machine COPY TRIAD MFLOPS Cray_C90 6965.4 9500.7 791.7 Cray_J932 1433.6 1270.0 105.8 Compaq_AlphaServer_DS20 1077.0 1323.0 110.2 IBM_RS6000-397 778.8 882.4 73.5 Cray_T3E-900 484.0 568.0 47.3 SGI_Origin_2K 332.0 358.0 29.8 Generic_440BX_400 304.0 315.4 26.3 Sun_Ultra2-2200 228.5 189.9 25.9 Sun_UE_10000 164.0 202.0 16.8 Apple_Mac_G3_266 137.1 137.1 11.4 13 April, 1999 High Performance on the J90 Systems 5

  6. Tools F90 (with lots of options) ja ./name ja -cst -n name hpm prof flowview atexpert 13 April, 1999 High Performance on the J90 Systems 6

  7. Program “SLOW” PROGRAM SLOW IMPLICIT NONE INTEGER, PARAMETER :: DIMSIZE=8000000 REAL, DIMENSION(DIMSIZE) :: X, Y, Z INTEGER:: I, J X = RANF() Y = RANF() DO J = 1, 10 DO I = 1, DIMSIZE Z(I)=LOG(SIN(X(I))**2+COS(Y(I))**4) END DO PRINT *, Z(DIMSIZE-1) ENDDO STOP END PROGRAM SLOW 13 April, 1999 High Performance on the J90 Systems 7

  8. No Optimization f90 -O0 -r6 -O,msgs,negmsgs -o slow slow.f90 x = RANF() cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=8 A loop starting at line 8 was vectorized. y = RANF() cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=9 A loop starting at line 9 was vectorized. 13 April, 1999 High Performance on the J90 Systems 8

  9. Moderate Optimization f90 -O1 -r6 -O,msgs,negmsgs -o slow slow.f90 do j = 1, 10 cf90-6286 f90:VECTOR SLOW,File = slow.f90,Line=10 A loop starting at line 10 was not vectorized because it contains input/output operations at line 14. DO i = 1, DIMSIZE cf90-6204 f90:VECTOR SLOW,File = slow.f90,Line=11 A loop starting at line 11 was vectorized. z(i) = LOG(SIN(x(i))**2 + COS(y(i))**4) cf90-6001 f90:SCALAR SLOW,File=slow.f90,Line=12 An exponentiation was replaced by optimization. This may cause numerical differences. 13 April, 1999 High Performance on the J90 Systems 9

  10. High Optimization f90 -O3 -r6 -O,msgs,negmsgs -o slow slow.f90 cf90-6502 f90:TASKING SLOW,File=slow.f90,Line=10 A loop starting at line 10 was not tasked because it contains input/output operations at line 14. cf90-6403 f90:TASKING SLOW,File=slow.f90,Line=11 A loop starting at line 11 was tasked. 13 April, 1999 High Performance on the J90 Systems 10

  11. Optimization Results Opt NCPUS Elapsed User Sys 0 768.7530 583.6793 7.1886 1 89.0162 82.1009 1.1936 2 104.7003 81.5687 1.0003 3 1 107.0177 81.6185 1.2994 3 2 44.6562 81.7050 1.4069 3 3 41.3401 81.5320 1.3099 3 4 24.8146 81.8099 1.2968 13 April, 1999 High Performance on the J90 Systems 11

  12. 2 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) --------------- --------------- ----------- 1 * 5.4300 = 5.4300 2 * 38.1300 = 76.2600 (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) --------------- -------------- ----------- 1.88 * 43.5600 = 81.6900 13 April, 1999 High Performance on the J90 Systems 12

  13. 3 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) --------------- --------------- ----------- 1 * 9.2200 = 9.2200 2 * 13.5500 = 27.1000 3 * 15.0700 = 45.2100 (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) --------------- -------------- ----------- 2.15 * 37.8400 = 81.5300 13 April, 1999 High Performance on the J90 Systems 13

  14. 4 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) --------------- --------------- ----------- 1 * 2.0400 = 2.0400 2 * 1.7700 = 3.5400 3 * 5.3200 = 15.9600 4 * 15.0700 = 60.2800 (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) --------------- -------------- ---------- 3.38 * 24.2000 = 81.8200 13 April, 1999 High Performance on the J90 Systems 14

  15. Useful F90 Options -e (0 or i)- initializes storage or flags use of unitialized vars -e n - flags nonstandard fortran usage -e v - make all variables static -g - same as -G0 -G (0 or 1)- sets debugging level to statement or block -m (0 - 4)- message verbosity (0 gives most output) -N (72, 80, or 132) - source line length -O- Optimization levels 0,1,2,3, aggress, fastint, msgs, negmsgs, inline(0-3), scalar(0-3), task(0-3), vector (0-3) -r (0-6, …)- listing levels (6 is EVERYthing) -R (a, b, c)- runtime checking: args, array bounds, indexing 13 April, 1999 High Performance on the J90 Systems 15

  16. Using flowtrace/flowview f90 -O1 -ef -o slow slow.f90 ./slow flowview -Luch > slow.flow Routine Tot Time Percentage Accum% ------------ -------- ---------- ------- SUB2 5.66E+01 69.02 69.02 SUB1 2.43E+01 29.63 98.65 SLOW 1.11E+00 1.35 100.00 13 April, 1999 High Performance on the J90 Systems 16

  17. Using prof f90 -O1 -l prof -o slow slow.f90 ./slow prof -x ./slow > slow.prof profview slow.prof 13 April, 1999 High Performance on the J90 Systems 17

  18. profview Output 13 April, 1999 High Performance on the J90 Systems 18

  19. Optimization Strategies • First, let the compiler do it • Vectorize and scalar optimize, then parallelize • Vectorization can give you a factor of 10 speedup • Scalar optimization can improve performance by 10-50% • Parallelism will give you a linear speedup, max • Memory contention inhibits gains from parallelism • Let the compiler advise you • Add directives where appropriate • Be sure you tell the truth • Check your answers 13 April, 1999 High Performance on the J90 Systems 19

  20. Scalar Optimization Subroutine or function inlining Fast (32-bit) integers -Oallfastint -Ofastint Use INTERFACE specifications if passing array sections 13 April, 1999 High Performance on the J90 Systems 20

  21. Vectorization 13 April, 1999 High Performance on the J90 Systems 21

  22. Inhibitors to Vectorization Function or subroutine references Inline Push loop Split loop Backwards data dependencies Reorder loop, use temporary vector I/O statements Character or bit manipulations Branches into loop or backward out of loop 13 April, 1999 High Performance on the J90 Systems 22

  23. Nonvectorizable Code DO I = 1, N CALL CALC(X(I), Y(I), Z(I)) ENDDO ... SUBROUTINE CALC(X, Y, Z) Z = ALOG(SQRT((SIN(X) * COS(Y)) ** X)) RETURN END 13 April, 1999 High Performance on the J90 Systems 23

  24. Inlining DO I = 1, N Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I))) ENDDO 13 April, 1999 High Performance on the J90 Systems 24

  25. Pushing CALL CALC(X(I), Y(I), Z(I), N) ... SUBROUTINE CALC(X, Y, Z, N) DIMENSION X(N), Y(N), Z(N) DO I = 1, N Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I))) ENDDO RETURN END 13 April, 1999 High Performance on the J90 Systems 25

  26. Splitting DO I = 1, N A(I) = ABS(CALC(C(I))) B(I) = A(I) ** T * SQRT(C(I)) A(I) = SIN(ALOG(C(I))) ENDDO 13 April, 1999 High Performance on the J90 Systems 26

  27. Splitting (cont.) EXTERNAL CALC DO I = 1, N A(I) = ABS(CALC(C(I))) ENDDO DO I = 1, N B(I) = A(I) ** T * SQRT(C(I)) A(I) = SIN(ALOG(C(I))) ENDDO 13 April, 1999 High Performance on the J90 Systems 27

  28. Scalar Recurrence DIMENSION A(1000), C(1000) DO J = 1, M S = BB DO I = 1, N S = S * C(I) A(I) = A(I) + S ENDDO ENDDO <cf90-8135,Scalar,Line=7> Loop starting at line 7 was unrolled 16 times. 13 April, 1999 High Performance on the J90 Systems 28

  29. Scalar Recurrence (cont.) DIMENSION A(1000), C(1000), S(1000) DO I = 1, M S(I) = BB ENDDO DO I = 1, N DO J = 1, M S(J) = S(J) * C(I) A(I) = A(I) + S(J) ENDDO ENDDO Loop starting at line 5 was unrolled 2 times. A loop starting at line 5 was vectorized. A loop starting at line 9 was vectorized. 13 April, 1999 High Performance on the J90 Systems 29

  30. Compiler Vector Directives CDIR$ directive !DIR$ directive VECTOR, NOVECTOR Turn vectorization on or off until end of program unit. IVDEP Ignore vector dependencies in next loop. 13 April, 1999 High Performance on the J90 Systems 30

  31. Parallel Computing Multitasking, microtasking, autotasking, parallel processing, multiprocessing, etc. This is “fine-grained” parallelism parallelism mostly comes from loop slicing One possible goal: parallelize outer loop(s), vectorize inner loop(s) F90 is capable of autotasking, but it can always benefit from help 13 April, 1999 High Performance on the J90 Systems 31

  32. Parallelism 13 April, 1999 High Performance on the J90 Systems 32

  33. Parallelism, cont. 13 April, 1999 High Performance on the J90 Systems 33

  34. Data “Scoping” DIMENSION A(N) SUM = 0.0 DO I = 1, N TEMP = DEEP_THOUGHT(A,I) SUM = SUM + TEMP * A(I) ENDDO A, NShared, read-only everywhere I, TEMPPrivate, read-write everywhere SUM Shared, read-write everywhere 13 April, 1999 High Performance on the J90 Systems 34

  35. Compiler Tasking Directives DIMENSION A(N) SUM = 0.0 !MIC$ DOALL SHARED(A,N),PRIVATE(I,TEMP) DO I = 1, N TEMP = DEEP_THOUGHT(A,I) * A(I) !MIC$ GUARD SUM = SUM + TEMP !MIC$ ENDGUARD ENDDO 13 April, 1999 High Performance on the J90 Systems 35

  36. Threshold Test DIMENSION A(N) SUM = 0.0 !MIC$ DOALL VECTOR !MIC$ IF(N.GT.1000) !MIC$ SHARED(A,N),PRIVATE(I,TEMP) DO I = 1, N TEMP = DEEP_THOUGHT(A,I) !MIC$ GUARD SUM = SUM + TEMP * A(I) !MIC$ ENDGUARD ENDDO 13 April, 1999 High Performance on the J90 Systems 36

  37. Helping F90 with Parallelism DIMENSION A(N), SUM(NumTasks) !MIC$ DOALL SHARED(A,N),PRIVATE(J,I,TEMP) DO J = 1, NumTasks SUM(J) = 0.0 !MIC$ CNCALL DO I = 1, N SUM(J) = SUM(J) = DEEP_THOUGHT(A,I,J) * A(I) ENDDO ENDDO DO J = 1, NumTasks TSUM = TSUM + SUM(J) ENDDO 13 April, 1999 High Performance on the J90 Systems 37

  38. Helping F90 with Directives • Useful compiler directives for tasking • CASE, ENDCASE • CNCALL • DOALL • DOPARALLEL, ENDDO • GUARD, ENDGUARD • MAXCPUS • NUMCPUS • PERMUTATION • PARALLEL, ENDPARALLEL • These all begin with !MIC$ • NOTE:There are also OpenMP directives... 13 April, 1999 High Performance on the J90 Systems 38

  39. Helping F90 with Directives, cont. • Directive Work Distribution • CHUNKSIZE • GUIDED • NCPUS_CHUNKS • NUMCHUNKS • SINGLE • VECTOR Directive Parameters AUTOSCOPE IF MAXCPUS PRIVATE SAVELAST SHARED These all augment !MIC$ directives NOTE:There are also OpenMP directive parameters... 13 April, 1999 High Performance on the J90 Systems 39

  40. atexpert f90 -eX -O3 -r6 -o slow slow.f90 setenv NCPUS 1 ./slow atexpert 13 April, 1999 High Performance on the J90 Systems 40

  41. atexpert Output 13 April, 1999 High Performance on the J90 Systems 41

  42. atexpert Output, cont. 13 April, 1999 High Performance on the J90 Systems 42

  43. atexpert Output, cont. 13 April, 1999 High Performance on the J90 Systems 43

More Related