1 / 34

Parallel Programming using the Iteration Space Visualizer

Parallel Programming using the Iteration Space Visualizer. Yijun Yu and Erik H. D'Hollander University of Ghent, Belgium http://www.elis.rug.ac.be/paris/ppt. Introduction. Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG

atira
Download Presentation

Parallel Programming using the Iteration Space Visualizer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming using the Iteration Space Visualizer Yijun Yu and Erik H. D'Hollander University of Ghent, Belgium http://www.elis.rug.ac.be/paris/ppt

  2. Introduction • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

  3. Iteration Space Visualizer Parallel Compiler Instrument the program Dataflow Analysis Construct the ISDG exact? Interactive Visualize transformation Dependence Analysis Visualize dependence Automatic why? ProgramTransformation Code Generation Overview of the approach Program

  4. Introduction (2) • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

  5. Loop Dependence • Nested loopsare the focus of the parallel programming • Data dependenceshappen when there are multiple accesses to the same memory locations where at least one of them WRITE • Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE) • Loop dependence is the ordering between data dependent loop iterations

  6. The Iteration Space Dependence Graph (ISDG) The object to be visualized is …ISDG = Iteration Space + Loop Dependence • An iteration I=(i1..im) is a point in the m-D iterationspace, which is mapped to the 3D space • The dependent iterations I and J are linked by an arrow I J

  7. k do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddoenddo (1,1,2) (1,2,2) (1,3,2) (1,4,2) (2,1,2) (2,2,2) (2,3,2) (3,1,2) (2,4,2) (3,2,2) (3,3,2) (3,4,2) (4,1,2) (4,2,2) (4,3,2) (4,4,2) (1,1,1) (1,2,1) (1,3,1) (1,4,1) (2,1,1) (2,2,1) (2,3,1) (3,1,1) (2,4,1) j (3,2,1) (3,3,1) (3,4,1) (4,1,1) (4,2,1) (4,3,1) i (4,4,1) An example of ISDG

  8. Instrumentation and the ISDG construction • Program instrumentation • Loop iteration: id + indices • Array reference: id + name + Read | Write + subscripts • ISDG construction • Create the iteration points from indices • Setup a reference list for every accessed location • Mark Flow-, Anti- and Output-dependence arrows

  9. Introduction (3) • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

  10. Dependence Visualization • Loop visualization • 3D view-port of Iteration space • Graphical operations • Detecting and enhancing parallelism • Automatic parallelization • Maximal parallelism detection • Parallelization by plane execution

  11. Loop Visualization • Visualization of the ISDG • Points + Arrows + Colors + Labels + Axes • 3D view-port of Iteration space • =3D, >3D and < 3Dprojection (condensed points and arrows)expansion (dummy index dimension) • ISDG operations • Graphical operations: rotate, move and animate • Query dialogs: selection, variable zooming and dependence type filtering, etc.

  12. Automatic Parallelization • Sequential execution • Traverse the iteration space in lexicographical order and count the iterations TSeq • Parallel execution • Traverse the iterations in a marked loop in parallel and count the steps Tpar • Report speedup Spara = Tseq / Tpar • Automatic parallelization • Test whether the dependence ordering is kept for all combinations of loop parallelizations :DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?

  13. Maximal Parallelism Detection • Data-flow order • An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out • The iterations of the same delay are executed at the same time, i.e. in parallel • The dependent iterations are executed sequentially. Count the steps Tdf • Minimal executing time = Maximal parallelism • Maximal speedup Smax = Tseq/Tdf

  14. Plane Parallelization • Define a cutting plane Ax+By+Cz=D • Clicking at three points • Giving parameters A,B,C,D • Plane execution • Traverse the planes d0  Ax+By+Cz<d0+Tdalong the normal vector (A,B,C) • Plane parallelization • Matching the dataflow execution may enhance speedup Splane=Tseq/Td • Verified by cross-plane dependence checking or 3D->2D projection checking

  15. Prune false dependences Start Maximal parallelism detection Sdf Automatic parallelization Spara Yes No Plane parallelization Splane Splane>Spara? No Yes Program transformation End Dependence Visualization procedural summary Spara=Sdf?

  16. Program Transformations When Sdf>Spara, loop transformations may enhance the parallelism of the target loop… • Unimodular Loop Transformations • Why? 3D 3D, 1-to-1, etc. • Loop Projections and Expansions • Loop Projection: >3D 3D • Loop Expansion: <3D 3D

  17. ? ? ! ? ! ? A ? A ? ? B ? ? ? B ! ! Normal vector (A,B,C) • Unimodular • Legality ! C ! ? ? ? ? ? C Unimodular Transformations Look for a suitable transformation • Interactive way • Automatic way • Possible when array index expression are linear and all the distance vectors lie in a plane • Extract largest base vectors of the dependence distances and construct the transformation (pseudo distance matrix approach)

  18. Loop Expansion • Non-perfectly vs perfectly nested loop • Statementvs Iteration-level parallelism • Statement reordering affine remapping • Loop expansionUse additional dimension to index the statements in the loop body • Unimodular loop transformations are still applicable at the statement level

  19. Introduction • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

  20. Application and Results • Gauss-Jordan: linear system solver • Lim’s example: statement-level parallelism • Cholesky kernel: loop projection • CFD application: unimodular transformation

  21. id=0 do i = 1,n do j = 1,n if (i.ne.j) then write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i) C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo

  22. (1,4,5) (2,4,5) (3,4,5) K (4,3,5) (1,3,5) (2,3,5) (3,2,5) (4,2,5) (1,2,5) (2,1,5) (3,1,5) (4,1,5) J (1,4,4) (2,4,4) (3,4,4) I (1,3,4) (2,3,4) (3,2,4) (1,2,4) (2,1,4) (3,1,4) (1,4,3) (2,4,3) (1,3,3) (2,3,3) (1,2,3) (2,1,3) (1,4,2) (1,3,2) (1,2,2) Plane: I = 1 Seq. time: 30 Dataflow: 4, Speedup: 7.5 DOALL J, K valid Loop time: 4, Speedup: 7.5 Gauss-Jordan elimination

  23. Loop Expansion Lim’s Example do l1=1,n do l2=1,n c$doisv do l3=0,1 if(l3.eq.0)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddo The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo

  24. i2 l2 i3 1 1 0 i1 1 0 -1 l1 l3 0 1 0 Plane: L1-L2+L3=0 Plane: i1 = 0 Seq. time: 32 Dataflow: 7, Speedup: 4.57 Seq. time: 32 Dataflow: 7, Speedup: 4.57 DOALL L3 valid Loop time: 16, Speedup: 2.00 Loop time:7, Speedup: 4.57 DOALL i1 valid Lim’s exampleunimodular transformation

  25. FourierMotzkin 1 1 0 1 0 0 -1 -1 0 1 0 1 1 0 1 0 1 0 Inversion Lim’s exampleCode generation C The unimodular transformed code doalli1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddoall

  26. 1 1 0 -1 0 1 1 0 0 I’ = I – J + K J’ = IK’= J Lim’s exampleCode generation symbolic n; IS1:={[i,j,k]:1<=i,j<=n && k=0}; IS2:={[i,j,k]:1<=i,j<=n && k=1}; T1:={[i,j,k]->[i-j+k,i,j]}; T2:={[i,j,k]->[i-j+k,i,j]}; codegen 0 T1:IS1,T2:IS2;

  27. 1 1 0 -1 0 1 1 0 0 Lim’s exampleCode generation C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall

  28. Loop Fusion Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)

  29. Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DOALL1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE (L,I,K,J)

  30. CFD application • Computation Fluid Dynamics CFDNavier-Stokes equations • Successive Over-Relaxation SOR • Kernel 3D loop:difficult to analyze 172 array references/iteration 33 if-branches/iteration • Unimodular transformation found!

  31. i2 3 1 0 I2’ 0 1 2 i3 i1 I1’ 0 1 0 I3’ (9,2,1) Range: i1= 1, 4 Range: i2= 1, 4 I1’= 6,24 (1,2,2) (9,1,1) i3= 1, 4 I2’= 1, 4 I3’= 1, 4 (1,1,4) (9,1,2) Plane: i1’=9 (2,1,1) Plane: 3 i1+2 i2+i3=9 Seq. time: 64 Dataflow: 19, Speedup: 3.37Loop time: 64, Speedup: 1.00 Seq. timeDOALL i2’,i3’ Dataflow: 19, Speedup: 3.37Loop time:19,Speedup: 3.37 CFD Application

  32. Conclusion and Future work • Allowing the exact visualization of real program loops • Assistance with detecting parallel loops • Estimation of maximal speedup using dataflow execution • Assistance with finding suitable loop transformations • Future work: • Seemless Integration into PPT (parallel programming environment)

  33. THANKS • For you attention! • Any question?

More Related