1 / 30

COMP60621 Concurrent Programming for Numerical Applications

COMP60621 Concurrent Programming for Numerical Applications. Lecture 10 Restructuring for Performance (getting the compiler on the scene) Originally prepared by Rizos Sakellariou Centre for Novel Computing School of Computer Science University of Manchester. Overview.

Download Presentation

COMP60621 Concurrent Programming for Numerical Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP60621Concurrent Programming for Numerical Applications Lecture 10 Restructuring for Performance (getting the compiler on the scene) Originally prepared by Rizos Sakellariou Centre for Novel Computing School of Computer Science University of Manchester

  2. Overview • Introduction to Parallelising Compilers • Loops and Iteration Spaces • Loop Transformations • Conclusions and Further Reading

  3. Parallelising Compilers: An Introduction A parallelising compiler accepts as an input a sequential program and transforms it to a semantically equivalent parallel one tailored to the requirements of a specific parallel architecture (automatic parallelisation). • Why parallelising compilers? • To reduce the effort for parallelising sequential programs. • To achieve, potentially, higher performance. • Current state-of-the-art: • A few commercial auto-parallelisers (KAP, PFA). • Many experimental tools publicly available. • Targeting at scientific codes: • FORTRAN (because of its simplicity and its widespread use; C and pointers are a nightmare!) and DO...ENDDO type of loops. • Effectiveness: • Not as good as an expert programmer, however, they may provide assistance in various stages of the parallelisation process (interactive parallelisation).

  4. Parallelising a Sequential Program Depending on the parallel computer we use, there are several forms of parallelism that can be exploited. e.g.: • Statement Parallelism. a = b + c d = e + f The two statements can be executed concurrently. • Loop Parallelism. DO i=1,10 a(i)=b(i)+c(i) ENDDO Different loop iterations can be executed in parallel (up to 10 times increase in the loop execution time may be achieved). • Task Parallelism.

  5. Parallelising a Sequential Program • We concentrate on loop parallelism; this is because most of the execution time of a scientific program is spent on loops and the latter are usually a rich source of parallelism (cf. the potential speed-up from the two codes above). • We also focus on DO...ENDDO type of loops (not DO WHILE).

  6. The Structure of a Parallelising Compiler

  7. Parallelising Compiler’s Optimisations • Standard compiler (machine-independent) optimisations: • Dead-code elimination. • Common subexpression elimination. • Constant folding. • Detection of loop-invariant computations. • etc. (see Chapter 10 in ‘Compilers’ by Aho, Sethi, Ullman). • Transformations for uncovering parallelism. • Low-level (machine-dependent) optimisations: • see Chapter 9 in ‘Compilers’ by Aho, Sethi, Ullman.

  8. Parallelising Compilers: Design Issues • Since the objective is to increase performance (that is, decrease the execution time), a parallelising compiler needs a performance prediction model to evaluate the effects of the transformations it applies. Such a model can be based on the estimation of overheads (as described in previous lectures): unparallelised code, load imbalance, communication, synchronisation, parallelism start-up. • Then, for a given program, the time to execute the program on Pprocessors can be computed as where Toverheads is the time spent on overheads and TS is the time required for sequential execution of the program. • Thus, in order to increase performance (that is, minimise TP) the compiler should minimise the overheads.

  9. Parallelising Compilers: Design Issues Three phases are at the core of the automatic parallelisation process: • Detection phase: attempts to discover loops which can be executed in parallel without altering the program semantics (i.e. the program must return the same results). • Detecting parallelsim is the subject of dependence analysis. • Optimisation phase: attempts to uncover parallelism and minimise overheads using loop transformations. • Mapping phase (partitioning and scheduling): maps the parallelism onto a parallel architecture (i.e. which processor executes what part of the parallel code). • We concentrate on the optimisation phase and discuss the mapping phase in other lectures.

  10. Parallelising Compilers: Design Issues • Detection phase: it attempts to discover loops which can be executed in parallel without altering the program semantics (i.e. the program must return the same results). Consider the two loops below: this is a parallel loop: this should NOT be executed in parallel: DO I=1,N DO I=1,N A(I)=I A(I)=I B(I)=2*I-1 B(I)=A(2*I-1) ENDDO ENDDO

  11. Loop Terminology A perfect loop nest (depth 2): DO I = 1, N loop index : I, J DO J = 1, N, S lower bound: 1 (both loops) A(I,J) = 0 upper bound: N (both loops) ENDDO loop stride: 1(outer),S(inner) ENDDO loop body For any particular execution of a statement in the loop nest, the values of the indices of the surrounding loops define an iteration point. The set of all the iteration points defines the iteration space of the loop nest. The above loop nest contains N2iteration points, if S=1.

  12. Representing the iteration space DO I = 1, 8 DO J = 1, 8 A(I,J) = 0 ENDDO ENDDO

  13. Representing the iteration space DO I = 1, 8 DO J = 1, I A(I,J) = 0 ENDDO ENDDO this type of loops is called triangular.

  14. Representing the iteration space It is also useful to consider the iteration space as a polytope (ie, the general term of the sequence point, segment, polygon, polyhedron, ...); DO I=1,N DO J=1,I DO K=1,J ... ENDDO ENDDO ENDDO

  15. Homework Problems Describe the iteration space of the following loop nests.

  16. Loop Transformations • A loop nest is transformed into another form. It is important that program semantics are preserved. • The primary goal of loop transformations is to reduce the overheads, but the same transformation can be applied to reduce different sources of overheads. • Rough classification: • Loop reordering transformations: They change the relative order of execution of the iterations of a loop nest. • Loop restructuring transformations: They change the structure of the loop but the relative order of executing the iterations remains unchanged.

  17. Loop Transformations Example of a loop restructuring transformation: DO I=1,1000 DO I=1,1000,2 A(I)=(A(I-1)+A(I+1))/2 A(I)=(A(I-1)+A(I+1))/2 ENDDO A(I+1)=(A(I)+A(I+2))/2 ENDDO The above transformation is called loop unrolling (it increases instruction level parallelism and may improve the use of the cache memory).

  18. Loop Interchange Exchanges the position of two loops in a perfect loop nest: DO I = 1, 8 DO J = 1, 8 A(I,J) = 0 ENDDO ENDDO DO J = 1, 8 DO I = 1, 8 A(I,J) = 0 ENDDO ENDDO

  19. Loop Interchange Useful for increasing unit stride access (FORTRAN stores arrays in a columnwise order); the latter code runs faster on a machine with a memory hierarchy. Also useful for enabling parallelism but it is not always legal. (why? – think of an example.) DO J = 1, 8 DO I = 1, 8 A(I,J) = (A(I+1,J-1)+A(I-1,J+1)/2 ENDDO ENDDO

  20. Loop Skewing Changes the bounds of a loop by adding a value to both the upper and the lower bounds, at the same time subtracting the same quantity from every appearance of the loop index in the loop body; always legal. DO I = 1, 8 DO J = 1, 8 A(I,J) = 0 ENDDO ENDDO DO I = 1, 8 DO J = 1+I, 8+I A(I,J-I) = 0 ENDDO ENDDO Skewing may enable parallelism when used along with loop interchange.

  21. Loop Skewing DO I = 1, N DO J = 1, M A(I,J) = 0.25*(A(I,J-1)+A(I-1,J)+A(I+1,J)+A(I,J+1)) ENDDO ENDDO DO I = 1, N DO J = 1+I-1, M+I-1 A(I,J-I+1) = ………… ENDDO ENDDO

  22. Loop Skewing DO I = 1, N DO J = 1, M A(I,J) = 0.25*(A(I,J-1)+A(I-1,J)+A(I+1,J)+A(I,J+1)) ENDDO ENDDO DO K = 1, N DO J = 1, K I = K-J+1 A(I,J) = 0.25*(A(I,J-1)+A(I-1,J)+A(I+1,J)+A(I,J+1)) ENDDO ENDDO DO K = 1, N DO J = K, N ENDDO ENDDO

  23. Index set splitting Transforms a single loop nest into multiple adjacent loop nests where each loop performs a subset of the original iterations. It may be useful in removing conditionals (and data dependences). Example: DO I=1,100 DO I=1,75 C(I)=I C(I)=I IF (I.GT.75) THEN A(I)=I A(I)=I-75 ENDDO ELSE DO I=76,100 A(I)=I C(I)=I ENDIF A(I)=I-75 ENDDO ENDDO

  24. Index set splitting HomeworkProblem: Remove the conditional DO I=1,100 DO J=1,100 A(J,I)=100*J+I IF (I.GT.J) THEN B(J,I)=0 ELSE B(J,I)=I+J ENDIF ENDDO ENDDO

  25. Loop coalescing Combines multiple loops of a perfect loop nest into a single loop; eg: DO I=1,8 DO K=1,64 DO J=1,8 I=INT((K-1)/8)+1 A(I,J)=0 J=MOD(K-1,8)+1 ENDDO A(I,J)=0 ENDDO ENDDO It may help loop partitioning and scheduling but it may incur an additional computational overhead.

  26. Loop coalescing Homework Problem: Apply loop coalescing to the following loop nest: DO I=1,N DO J=1,I ... ENDDO ENDDO (Assume that N>>1) (Hint: You may also need the function CEILING(x) which returns the least integer greater than or equal to x)

  27. Loop Tiling • Transforms an n-deep loop nest into a 2n-deep loop nest. • Applying loop interchange, the iterations are performed over blocks of the original iteration space which can be fully accommodated in the cache. • Improves considerably the performance of certain matrix operations such as matrix multiplication. • The resulting code can perform significantly faster assuming a ‘good’ choice for the tile size, i.e., SJ,SI on the next slide.

  28. Loop Tiling Example (Matrix Multiply): DO JJ=1,N,SJ DO II=1,N,SI DO I=1,N DO J=JJ,JJ+SJ-1 DO J=1,N DO K=1,N DO K=1,N DO I=II,II+SI-1 A(I,J)+=B(I,K)*C(K,J) A(I,J)+=... ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO

  29. Conclusions • Parallelising compilers attempt to parallelise sequential programs automatically. • Loops and their inherent parallelism are the main focus of interest. • A number of data dependence tests, transformations, and mapping schemes are used for this purpose. • The main aim is to increase performance, that is, to reduce the overheads associated with parallel execution. • Program semantics must be preserved. • Although there is some pessimism for the final success of parallelising compilers there is much research in the area internationally.

  30. Further Reading Books: • U. Banerjee, Loop Transformations for Restructuring Compilers: The Foundations, Kluwer Academic Publishers, 1993. • M.Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, 1996. • M. Wolfe, Optimizing Supercompilers for Supercomputers, Research Monographs in Parallel and Distributed Computing Series, MIT Press, 1989. Articles: • D.F.Bacon, S.L.Graham, O.J.Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys, 26(4), Dec. 1994, pp. 345-420. • U.Banerjee, R.Eigenmann, A.Nicolau, D.Padua, Automatic Program Parallelization, Proceedings of the IEEE, 81(2), Feb.1993, pp.211-243. Theses: • R.Sakellariou, On the Quest for Perfect Load Balance in Loop-Based Parallel Computations, PhD Thesis, Department of Computer Science, University of Manchester, 1996.

More Related