1 / 36

Recurrence Chain Partitioning of Non-Uniform Dependences

2004. Recurrence Chain Partitioning of Non-Uniform Dependences. Yijun Yu Erik H. D ’ Hollander. Overview. Dependence and Parallelism Non-Uniform Loop Dependences Recurrence Chains Partitioning Related work Implementations Experiment Results Summary. 0. 0. 0. 0. 0. 0. 0. 1. 2.

keagan
Download Presentation

Recurrence Chain Partitioning of Non-Uniform Dependences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2004 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D’Hollander Aug 15-18, Montreal, Canada

  2. Overview • Dependence and Parallelism • Non-Uniform Loop Dependences • Recurrence Chains Partitioning • Related work • Implementations • Experiment Results • Summary Aug 15-18, Montreal, Canada

  3. 0 0 0 0 0 0 0 1 2 3 1 1 0 1 3 0 3 1 3 2 0 1 2 1 3 3 0 0 0 0 0 0 shared memory execution trace A(1) = A(0) A(2) = A(1) A(3) = A(2) A(2) = A(1) A(1) = A(0) A(3) = A(2) 1. Background Dependence vs. Parallelism program DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO Aug 15-18, Montreal, Canada

  4. The CFD application @ WTCM • Computation Fluid Dynamics CFDNavier-Stokes equations • Successive Over-Relaxation SOR 3D geometry + 1D time temperature Aug 15-18, Montreal, Canada

  5. The visualized Uniform dependences and transformations for the 4D loop Before transformation After transformation A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regular shape. The transformation makes it possible to speed-up the program around N2/6 times where N is the diameter of the geometry. (Yu, Parco99) Aug 15-18, Montreal, Canada

  6. 2. Non-uniform dependences • Uniform loop dependences • Dependent iterations are apart at a uniform distance in the iteration space: a set of distance vector can predict the dependences and indicate the affine index loop transformation to reveal the maximal loop parallelism. • Non-uniform dependences • Irregular, can be caused by complex subscripts, compile-time unknowns, etc. • But not rare: in SPECfp95 benchmarks 46% nested loops and 12.8% of the coupled subscripts Aug 15-18, Montreal, Canada

  7. Non-uniform dependencesTip of the iceberg Aug 15-18, Montreal, Canada

  8. Speedup:13.3 Irregular dependence • Dependences have non-uniform distance • Parallelism Analysis:200 iterations over 15 data flow steps Problem: How to exploit it? Aug 15-18, Montreal, Canada

  9. 3. Recurrence Chain PartitioningResearch objectives If DO loops fail to reveal the optimal parallelism for irregular dependences, can one use WHILE loops? • WHEN can one apply WHILE loops? • HOW to construct WHILE loops? • WHAT to do when one can not apply WHILE loops? • HOW MUCH can be achieved by an evaluation purposes? Aug 15-18, Montreal, Canada

  10. 3.1 How to Generate code? • DOALL I = INIT(I) WHILE !TERMINATE(I) DO S(I) I = NEXT(I) END DOENDDOALL • INIT(I) =? • TERMINATE(I)=? • NEXT(I) =? Aug 15-18, Montreal, Canada

  11. 3.2 Solving recurrence equations in the unified iteration space • Dependence equations: iA + a = jB + b • Recurrence equations: j = i T + t or i = (j – t) T-1 = jT-1+ tT-1 • T = AB-1 • t = (a – b)B-1 • A recurrence chain is a sequence of dependent iterations, such that • iK+1 = iKT+ t, or iK+1= (iK-t)T-1 • i0={ i | not exist j such that iA+a = jB+b or iB+b = jA+a} • We have variable dependence distance dk=ik+1-ik: • dk+1 = dkT or dk=dk+1T-1 • d is not constant and exponential to a=max(1/|T|, |T|), thus the dependence chain length is O(loga L), where L is the diameter of the iteration space • When |T| is negative, one can cut recurrence chain to 2 iterations by lexicographical ordering Aug 15-18, Montreal, Canada

  12. 3.3 Generate code ? • DOALL I = i0WHILE ( I is in Iteration Space) DO S(I) I = IT+t or I = (I-t)T-1ENDDO ENDDOALL • Problem: How to tell which index update respects the dependency order? Aug 15-18, Montreal, Canada

  13. I2 I1 initial set final set R1 independent i0 non-integer integer i2 R2 i3 intermediateset i1 i0 non-integer i4 integer R3 R4 i0 i0 i1 cyclic iteration space Aug 15-18, Montreal, Canada

  14. 3.3 Generate code ! • DOALL I in P1 IF (IT+t < I) T = T-1; t = tT ENDIFWHILE ( I is in Iteration Space) DO S(I) I = IT+tENDDO ENDDOALL Aug 15-18, Montreal, Canada

  15. 4. Related work Strength of REC(1) Scalability • LEN = length of the chain • In comparison, unique-set oriented methods have to deal with LEN = 2, 3, … differently… • In REC, the WHILE loops adjust their steps automatically… Aug 15-18, Montreal, Canada

  16. 4. Related work Strength of REC(2) Outermost loop parallelism • Set-oriented:DOALL I in P1 S(I)DOALL I in P2 S(I)…DOALL I in Pn D(I) • Recurrence ChainDOALL I in P1 IF (I > IT+t) T = T-1; t = tTWHILE ( I in IS) DO S(I) I = IT+tENDDO ENDDOALL Aug 15-18, Montreal, Canada

  17. 4. Related workShortcoming and alternatives • Restriction in number of dep. Equations • Fall back to the following algorithms: • A recursive 3-sets partitioning (3P) (similar to unique-sets partitioning, but more accurate): can reuse the calculations for P1, P2, P3. • PDM and other uniformization techniques PDM is light-weight and can apply first, then apply 3P. Aug 15-18, Montreal, Canada

  18. Loop Partitioning GOAL MODEL Aug 15-18, Montreal, Canada

  19. REC sat den fully partly Aug 15-18, Montreal, Canada

  20. 3Region sat den fully partly Aug 15-18, Montreal, Canada

  21. PDM sat den fully partly Aug 15-18, Montreal, Canada

  22. 4. Implementations Front end: source to source transformations • PDM/PL in FPT • Set-oriented algorithms in FPT <-> XML/XSLT <-> OC Back end • Intel Fortran compiler + OPENMP directives Experiments on an EPICMP 4-CPU server Aug 15-18, Montreal, Canada

  23. 5. Results5.1 Yu, ICPP00 DO I1=1,N1 DO I2=1,N2 a(3*I1+1,2*I1+I2-1) =a(I1+3,I2+1) ENDDO ENDDO Aug 15-18, Montreal, Canada

  24. 5.1 Nonfull-rank PDM j1 i2 Aug 15-18, Montreal, Canada j2

  25. Aug 15-18, Montreal, Canada

  26. 5.2 Ju, 1997’s example DO I=1,N DO J=1,N a(2*I+3,J+1) = … =a(I+2*J+1,I+J+3) ENDDO ENDDO det(PDM) = 2 Aug 15-18, Montreal, Canada

  27. UNIQUE vs REC partitioning 13 2 Aug 15-18, Montreal, Canada

  28. Ju’s ExampleComparison • We corrected the loop bounds flaw in the Ju’s 97 paper and 5 unique sets were derived for this case when N = 12. • But theoretically O(2^(log2 N)) = O(N) UNIQUE sets are needed • In REC partitioning, just one set P1 needs to be calculated for the initial i0 Aug 15-18, Montreal, Canada

  29. Aug 15-18, Montreal, Canada

  30. 5.3 Chen, 96’s Example DO I=1,N DO J=1,I DO K=J,I ... = a(I+2*K+5,4*K-J) ENDDO a(I-J,I+J)= ... ENDDO ENDDO Aug 15-18, Montreal, Canada

  31. Chen’s Example A special case • It is a non-perfectedly nested loop • First convert it into the unified iteration space • Then symbolically calculate P1, P2, P3 and finds P2 = empty • Therefore the recurrence chains are at most 1 iteration long, regardless to the loop bounds • Both REC and Three-region partitioning lead to the same optimal solution Aug 15-18, Montreal, Canada

  32. Aug 15-18, Montreal, Canada

  33. Loop Fusion 5.4 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K) Aug 15-18, Montreal, Canada

  34. Aug 15-18, Montreal, Canada

  35. After loop fusion Recursive Three Region partitioning Aug 15-18, Montreal, Canada

  36. 6. Summary PDM 3R REC • Recurrence Chain partitioning is scalable to any size of the iteration space • REC partitioning reveals outermost parallelism, no synchronization between partitioned regions • The limitation of REC partitioning and its compensation: we provide fall back alternatives, if REC can not apply (1) PDM + Minimal distance (always applicable) (2) Recursive three-region partitioning (applicable for constant loop bounds, in some cases (e.g. Chen’s example) any loop bounds) Aug 15-18, Montreal, Canada

More Related