1 / 28

DEPENDENCE-DRIVEN LOOP MANIPULATION

DEPENDENCE-DRIVEN LOOP MANIPULATION. Based on notes by David Padua University of Illinois at Urbana-Champaign. DEPENDENCES. DEPENDENCES (continued). DEPENDENCE ANDPARALLELIZATION (SPREADING). OpenMP Implementation. RENAMING: To Remove Memory Related Dependencies . DEPENDENCES IN LOOPS.

lexine
Download Presentation

DEPENDENCE-DRIVEN LOOP MANIPULATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DEPENDENCE-DRIVEN LOOPMANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign

  2. DEPENDENCES

  3. DEPENDENCES (continued)

  4. DEPENDENCE ANDPARALLELIZATION (SPREADING)

  5. OpenMP Implementation

  6. RENAMING: To Remove Memory Related Dependencies

  7. DEPENDENCES IN LOOPS

  8. DEPENDENCES IN LOOPS (Cont.)

  9. DEPENDENCE ANALYSIS

  10. DEPENDENCE ANALYSIS (continued)

  11. LOOP PARALLELIZATION AND VECTORIZATION A loop whose dependence graph is cycle free can be parallelized or vectorized The reason is that if there are no cycles in the dependence graph, then there will be no races in the parallel loop.

  12. ALGORITHM REPLACEMENT • Some program patterns occur frequently in programs. They can be replaced with a parallel algorithm.

  13. LOOP DISTRIBUTION • To insulate these patterns, we can decompose loops into several loops, one for each strongly-connected component (π-block)in the dependence graph.

  14. LOOP DISTRIBUTION (continued)

  15. LOOP INTERCHANGING • The dependence information determines whether or not the loop headers can be interchanged. • The headers of the following loop can be interchanged

  16. LOOP INTERCHANGING (continued) • The headers of the following loop can not be interchanged

  17. DEPENDENCE REMOVAL: Scalar Expansion: • Some cycles in the dependence graph can be eliminated by using elementary transformations.

  18. Induction variable recognition • Induction variable: a variable that gets increased or decreased by a fixed amount on every iteration of a loop

  19. More about the DO to PARALLEL DO transformation • When the dependence graph inside a loop has no cross-iteration dependences, it can be transformed into a PARALLEL loop. do i=1,n S1: a(i) = b(i) + c(i) S2: d(i) = x(i) + 1 end do do i=1,n S1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 1 end do

  20. Loop Alignment • When there are cross iteration dependences, but no cycles, do loops can be aligned to be transformed into parallel loops (DOALLs)

  21. Loop Distribution • Another method for eliminating cross-iteration dependences

  22. Loop Coalescing for DOALL loops Consider a perfectly nested DOALL loop such as This could be trivially transformed into a singly nested loop with a tuple of variables as index: This coalescing transformation is convenient for scheduling and could reduce the overhead involved in starting DOALL loops.

  23. Extra Slides

  24. Why loop Interchange: Matrix Multiplication Example • A classic example for locality-aware programming is matrix multiplication for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) c[i,j] += a[i][k] * b[k][j]; • There are 6 possible orders for the three loops • i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i • Each order corresponds to a different access patterns of the matrices • Let’s focus on the inner loop, as it is the one that’s executed most often

  25. Inner Loop Memory Accesses • Each matrix element can be accessed in three modes in the inner loop • Constant: doesn’t depend on the inner loop’s index • Sequential: contiguous addresses • Stride: non-contiguous addresses (N elements apart) c[i][j] +=a[i][k] * b[k][j]; • i-j-k: Constant Sequential Strided • i-k-j: Sequential Constant Sequential • j-i-k: Constant Sequential Strided • j-k-i: Strided Strided Constant • k-i-j: Sequential Constant Sequential • k-j-i: Strided Strided Constant

  26. Loop order and Performance • Constant access is better than sequential access • it’s always good to have constants in loops because they can be put in registers • Sequential access is better than strided access • sequential access is better than strided because it utilizes the cache better • Let’s go back to the previous slides

  27. Best Loop Ordering? c[i][j] += a[i][k] * b[k][j]; i-j-k: Constant Sequential Strided i-k-j: Sequential Constant Sequential j-i-k: Constant Sequential Strided j-k-i: StridedStrided Constant k-i-j: Sequential Constant Sequential k-j-i: StridedStrided Constant • k-i-jand i-k-j should have the best performance (no Strided) • i-j-k and j-i-k should be worse ( 1 strided) • j-k-i and k-j-i should be the worst (2 strided)

More Related