Parallel Programming using the Iteration Space Visualizer

Parallel Programming using the Iteration Space Visualizer Yijun Yu and Erik H. D'Hollander University of Ghent, Belgium http://www.elis.rug.ac.be/paris/ppt

Introduction • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

Iteration Space Visualizer Parallel Compiler Instrument the program Dataflow Analysis Construct the ISDG exact? Interactive Visualize transformation Dependence Analysis Visualize dependence Automatic why? ProgramTransformation Code Generation Overview of the approach Program

Introduction (2) • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

Loop Dependence • Nested loopsare the focus of the parallel programming • Data dependenceshappen when there are multiple accesses to the same memory locations where at least one of them WRITE • Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE) • Loop dependence is the ordering between data dependent loop iterations

The Iteration Space Dependence Graph (ISDG) The object to be visualized is …ISDG = Iteration Space + Loop Dependence • An iteration I=(i1..im) is a point in the m-D iterationspace, which is mapped to the 3D space • The dependent iterations I and J are linked by an arrow I J

k do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddoenddo (1,1,2) (1,2,2) (1,3,2) (1,4,2) (2,1,2) (2,2,2) (2,3,2) (3,1,2) (2,4,2) (3,2,2) (3,3,2) (3,4,2) (4,1,2) (4,2,2) (4,3,2) (4,4,2) (1,1,1) (1,2,1) (1,3,1) (1,4,1) (2,1,1) (2,2,1) (2,3,1) (3,1,1) (2,4,1) j (3,2,1) (3,3,1) (3,4,1) (4,1,1) (4,2,1) (4,3,1) i (4,4,1) An example of ISDG

Instrumentation and the ISDG construction • Program instrumentation • Loop iteration: id + indices • Array reference: id + name + Read | Write + subscripts • ISDG construction • Create the iteration points from indices • Setup a reference list for every accessed location • Mark Flow-, Anti- and Output-dependence arrows

Introduction (3) • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

Dependence Visualization • Loop visualization • 3D view-port of Iteration space • Graphical operations • Detecting and enhancing parallelism • Automatic parallelization • Maximal parallelism detection • Parallelization by plane execution

Loop Visualization • Visualization of the ISDG • Points + Arrows + Colors + Labels + Axes • 3D view-port of Iteration space • =3D, >3D and < 3Dprojection (condensed points and arrows)expansion (dummy index dimension) • ISDG operations • Graphical operations: rotate, move and animate • Query dialogs: selection, variable zooming and dependence type filtering, etc.

Automatic Parallelization • Sequential execution • Traverse the iteration space in lexicographical order and count the iterations TSeq • Parallel execution • Traverse the iterations in a marked loop in parallel and count the steps Tpar • Report speedup Spara = Tseq / Tpar • Automatic parallelization • Test whether the dependence ordering is kept for all combinations of loop parallelizations :DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?

Maximal Parallelism Detection • Data-flow order • An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out • The iterations of the same delay are executed at the same time, i.e. in parallel • The dependent iterations are executed sequentially. Count the steps Tdf • Minimal executing time = Maximal parallelism • Maximal speedup Smax = Tseq/Tdf

Plane Parallelization • Define a cutting plane Ax+By+Cz=D • Clicking at three points • Giving parameters A,B,C,D • Plane execution • Traverse the planes d0  Ax+By+Cz<d0+Tdalong the normal vector (A,B,C) • Plane parallelization • Matching the dataflow execution may enhance speedup Splane=Tseq/Td • Verified by cross-plane dependence checking or 3D->2D projection checking

Prune false dependences Start Maximal parallelism detection Sdf Automatic parallelization Spara Yes No Plane parallelization Splane Splane>Spara? No Yes Program transformation End Dependence Visualization procedural summary Spara=Sdf?

Program Transformations When Sdf>Spara, loop transformations may enhance the parallelism of the target loop… • Unimodular Loop Transformations • Why? 3D 3D, 1-to-1, etc. • Loop Projections and Expansions • Loop Projection: >3D 3D • Loop Expansion: <3D 3D

? ? ! ? ! ? A ? A ? ? B ? ? ? B ! ! Normal vector (A,B,C) • Unimodular • Legality ! C ! ? ? ? ? ? C Unimodular Transformations Look for a suitable transformation • Interactive way • Automatic way • Possible when array index expression are linear and all the distance vectors lie in a plane • Extract largest base vectors of the dependence distances and construct the transformation (pseudo distance matrix approach)

Loop Expansion • Non-perfectly vs perfectly nested loop • Statementvs Iteration-level parallelism • Statement reordering affine remapping • Loop expansionUse additional dimension to index the statements in the loop body • Unimodular loop transformations are still applicable at the statement level

Introduction • Overview of the approach interactive vs automatic • Loop dependence • Iteration Space Dependence Graph ISDG • Instrumentation and construct ISDG • Visualization of … • Dependence • Transformations • Applications and Results • Conclusion and Future work

Application and Results • Gauss-Jordan: linear system solver • Lim’s example: statement-level parallelism • Cholesky kernel: loop projection • CFD application: unimodular transformation

id=0 do i = 1,n do j = 1,n if (i.ne.j) then write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i) C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo

(1,4,5) (2,4,5) (3,4,5) K (4,3,5) (1,3,5) (2,3,5) (3,2,5) (4,2,5) (1,2,5) (2,1,5) (3,1,5) (4,1,5) J (1,4,4) (2,4,4) (3,4,4) I (1,3,4) (2,3,4) (3,2,4) (1,2,4) (2,1,4) (3,1,4) (1,4,3) (2,4,3) (1,3,3) (2,3,3) (1,2,3) (2,1,3) (1,4,2) (1,3,2) (1,2,2) Plane: I = 1 Seq. time: 30 Dataflow: 4, Speedup: 7.5 DOALL J, K valid Loop time: 4, Speedup: 7.5 Gauss-Jordan elimination

Loop Expansion Lim’s Example do l1=1,n do l2=1,n c$doisv do l3=0,1 if(l3.eq.0)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddo The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo

i2 l2 i3 1 1 0 i1 1 0 -1 l1 l3 0 1 0 Plane: L1-L2+L3=0 Plane: i1 = 0 Seq. time: 32 Dataflow: 7, Speedup: 4.57 Seq. time: 32 Dataflow: 7, Speedup: 4.57 DOALL L3 valid Loop time: 16, Speedup: 2.00 Loop time:7, Speedup: 4.57 DOALL i1 valid Lim’s exampleunimodular transformation

FourierMotzkin 1 1 0 1 0 0 -1 -1 0 1 0 1 1 0 1 0 1 0 Inversion Lim’s exampleCode generation C The unimodular transformed code doalli1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddoall

1 1 0 -1 0 1 1 0 0 I’ = I – J + K J’ = IK’= J Lim’s exampleCode generation symbolic n; IS1:={[i,j,k]:1<=i,j<=n && k=0}; IS2:={[i,j,k]:1<=i,j<=n && k=1}; T1:={[i,j,k]->[i-j+k,i,j]}; T2:={[i,j,k]->[i-j+k,i,j]}; codegen 0 T1:IS1,T2:IS2;

1 1 0 -1 0 1 1 0 0 Lim’s exampleCode generation C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall

Loop Fusion Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)

Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DOALL1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE (L,I,K,J)

CFD application • Computation Fluid Dynamics CFDNavier-Stokes equations • Successive Over-Relaxation SOR • Kernel 3D loop:difficult to analyze 172 array references/iteration 33 if-branches/iteration • Unimodular transformation found!

i2 3 1 0 I2’ 0 1 2 i3 i1 I1’ 0 1 0 I3’ (9,2,1) Range: i1= 1, 4 Range: i2= 1, 4 I1’= 6,24 (1,2,2) (9,1,1) i3= 1, 4 I2’= 1, 4 I3’= 1, 4 (1,1,4) (9,1,2) Plane: i1’=9 (2,1,1) Plane: 3 i1+2 i2+i3=9 Seq. time: 64 Dataflow: 19, Speedup: 3.37Loop time: 64, Speedup: 1.00 Seq. timeDOALL i2’,i3’ Dataflow: 19, Speedup: 3.37Loop time:19,Speedup: 3.37 CFD Application

Conclusion and Future work • Allowing the exact visualization of real program loops • Assistance with detecting parallel loops • Estimation of maximal speedup using dataflow execution • Assistance with finding suitable loop transformations • Future work: • Seemless Integration into PPT (parallel programming environment)

THANKS • For you attention! • Any question?

Parallel Programming using the Iteration Space Visualizer

Parallel Programming using the Iteration Space Visualizer

Presentation Transcript

Programming patterns involving iteration

Parallel Matlab programming using Distributed Arrays

Parallel Programming

PARALLEL programming

Programming Parallel Hardware using MPJ Express

Parallel Programming Using the Global Arrays Toolkit

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming using MPI

Parallel Programming

Parallel Computing/Programming using MPI

Using the Iteration Space Visualizer in Loop Parallelization

Parallel Programming

Introductions to Parallel Programming Using OpenMP

Introductions to Parallel Programming Using OpenMP

Parallel Matlab programming using Distributed Arrays

Parallel Programming using the PGAS Approach