Optimizing HPF Extensions for Irregular Data Structures in Parallel Programs

Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de données irrégulières Frédéric Brégier Thèse présentée à l’Université de Bordeaux I 21 Décembre 1999 Frédéric Brégier - LaBRI

Frame of Work • Parallel program by compilation • HPF: standard for Data-parallel programs (regular programs) • Need investments for irregular programs: poor efficiencies • Optimizations at compile-time • Optimizations at run-time (generated at compile-time) Frédéric Brégier - LaBRI

Optimizations at compile-time • Irregular Data Structure (IDS) • A Tree to represent an IDS • Optimizations at run-time • Inspection-Execution principles • Irregular communications: irregular active processor sets • Irregular iteration spaces • Scheduling of loops with irregular loop-carried dependencies • New data-parallel irregular operation: progressive irregular prefix operation • Conclusion and Perspectives Plan Frédéric Brégier - LaBRI

HPF (High Performance Fortran): data-parallel language • May 1993 HPF 1.0, January 1997 HPF 2.0 • Fortran 95 source code + structured comments (!HPF$) • (distributions + parallel properties) • Target Code : SPMD parallel code B A B A B A B A X Y X Y X Y X Y • « Owner computes » rule • Runtime guards and communication generations IF (B(I) is local)THEN Send(B(J) to Owner(A(I))) END IF IF(A(I) is local)THEN Receive(in TMP from Owner(B(J))) A(I) = TMP + X END IF A(I) = B(J) + X Frédéric Brégier - LaBRI

Optimizations at compile-time • Loop iteration space • Affine expression • Local loop bounds • Not optimizable !HPF$ INDEPENDENT DO I = 1, N A(I) = A(I) + 1 END DO ! Cyclic Distribution case DO I = PID+1, N, NOP A(I) = A(I) + 1 END DO ! Block Distribution case (N dividable by NOP) LB = BLOC * PID + 1 UB = min(N, LB+BLOC) DO I = LB, UB A(I) = A(I) + 1 END DO ! Indirect distribution DO I = 1, N IF (A(I) is local) THEN A(I) = A(I) + 1 END IF END DO • Irregular = « what is not regular », not optimizable Frédéric Brégier - LaBRI

Plan • Optimizations at compile-time • Irregular Data Structure (IDS) • A Tree to represent an IDS • Optimizations at run-time • Inspection-Execution principles • Irregular communications: irregular active processor sets • Irregular iteration spaces • Scheduling of loops with irregular loop-carried dependencies • New data-parallel irregular operation: progressive irregular prefix operation • Conclusion and Perspectives Frédéric Brégier - LaBRI

3 I II III IV V VI VII VIII 1 1 3 5 6 9 12 16 18 21 1 2 3 4 5 6 7 8 1 1 5 5 2 2 5 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 • Irregular Data Structure (IDS) • Standard irregular format: indirect access arrays, example CSC JA(1:9) IA(1:20) DA(1:20) = Non zero values of A A(1,1) DA(JA(1)) (IA(JA(1)) = 1) A(6,4) DA(JA(4)+1) (IA(JA(4)+1) = 6) A(:,4) DA(JA(4):JA(5)-1) • Irregular distribution formats: !HPF$ DISTRIBUTE JA(BLOCK) !HPF$ DISTRIBUTE IA(GEN_BLOCK(/5, 10, 5/)) Frédéric Brégier - LaBRI

1 3 5 6 9 12 16 18 21 1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 • Problems at compile-time • Distribution : unknown alignment between arrays of the IDS • Data accesses: unknown indexes (indirection) JA(1:9) 6 DA(1:20) 6 DA(JA(4)+1) JA(4) = ? • Implies additional run-time guards and communications • Inefficient SPMD code Frédéric Brégier - LaBRI

Related Works • Regular to Irregular Compilation • Bik et Wijshoff : « Sparse Compiler » • Sparse Matrix with known topology • Regular analysis + known topology • IDS chosen by the compiler • Pingali et al. • Relational description (between components and access functions) • Non standard and difficult notations • Compilation of irregular programs • Vienna Fortran Compilation System: SPARSE directive • Storage format specification • Limited to storage formats known by the compiler Frédéric Brégier - LaBRI

I II III IV V VI VII VIII 1 2 3 4 5 6 7 8 I II III IV V VI VII VIII 1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 Tree Matrix CSC A(i)%COL(j)%VAL A(j,i) DA(JA(i)+j-1) A(i)%COL(:)%VAL A(:,i) DA(JA(i):JA(i+1)-1) The Tree: a generic data structure with hierarchical access • From a data to a tree: • Representation in HPF2: derived data type of Fortran 95 type level2 integer ROW !row number real VAL !non zero value end type level2 type level1 type (level2), pointer :: COL(:) !column end type level1 type (level1), allocatable :: A(:) !matrix with a hierarchical access by column !HPF$ TREE Frédéric Brégier - LaBRI

I II III IV V VI VII VIII 1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 Distribution of a TREE !HPF$ DISTRIBUTE A(BLOCK) !HPF$ DISTRIBUTE A(INDIRECT(/1,2,3,2,1,2,3,1/)) Frédéric Brégier - LaBRI

Example of improvement !HPF$ DISTRIBUTE A(BLOCK) !HPF$ INDEPENDENT FORALL (I = 3:N-2) A(I)%COL(:)%VAL = A(I-2)%COL(:)%VAL + A(I+2)%COL(:)%VAL END FORALL Communications on frontiers only As SHADOW in HPF2 local_bound(A(:), lb, ub) TMP(lb:ub) = Local Copy of Local Part(A(lb:ub)) Shadow_Update(TMP(:), -2,+2) local_bound(A(3:N-2), lb, ub) DO I = lb, ub A(I)%COL(:)%VAL = TMP(I-2)%COL(:)%VAL + TMP(I+2)%COL(:)%VAL END DO !HPF$ DISTRIBUTE DA(GEN_BLOCK(array)) !HPF$ INDEPENDENT FORALL (I = 3:N-2) DA(IA(I):IA(I+1)-1) = DA(IA(I-2):IA(I-1)-1) + DA(IA(I+2):IA(I+3)-1) END FORALL Global Copy+Bcast of DA TMP(:) = Global Copy with BCAST(DA(:)) DO I = 3, N-2 local_bound(DA(IA(I):IA(I+1)-1), lb, ub) DO J = lb, ub DA(J) = TMP(J1)+TMP(J2) END DO END DO IA(I-2) = ?? : IA(I-1)-1 = ?? Frédéric Brégier - LaBRI

Arrays Trees/Derived Types DALIB DALIB TriDenT I II III IV V VI VII VIII MPI MPI 1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 ADAPTOR I II III IV V VI VII VIII 1 5 2 5 3 4 6 8 1 2 5 4 6 7 8 6 7 4 6 8 Frédéric Brégier - LaBRI

Matrix Vector Product 4096x4096 IBM SP2-LaBRI Frédéric Brégier - LaBRI

Advantages: • Less indirections • Less unknown alignments • Better compile-time analysis (locality and dependence) • Generic (defined by the user) • Low overhead • Disadvantages: • Not necessary implemented in HPF compilers: portability • Need to rewrite irregular code (with derived types) Frédéric Brégier - LaBRI

Inspection-Execution Inspection: scan the program to analyze in order to get useful information Execution: execute the true computations according to the optimized scheme induced by the inspected information • Related works: • PARTI: iterative scheme • CHAOS: iterative and adaptive scheme (by steps) • Integrated in Fortran D and Vienna Fortran Compilation System • PILAR: iterative and multi-phase scheme, basic element = section • Compiler PARADIGM • ADAPTOR: directive TRACE, dynamic adaptive scheme often iterative schemes DO I = 1, N if (A(I) is local) then Add INDEX(I) to local_index end if END DO Exchange info on local_index (what indexes to send, to receive) INSPECTION DO STEP = 1, S END DO DO I = 1, N A(I) = B(INDEX(I)) END DO Modify B DO STEP = 1, S END DO Gather (B(local_index(:)) into Copy_B) I_local = 1 DO I = 1, N if (A(I) is local) then A(I) = Copy_B(I_local) I_local = I_local + 1 end if END DO Modify B EXECUTION Frédéric Brégier - LaBRI

HPF2: communication optimizations with active processor sets • ON HOME Directive: to control the computation mapping DO I = 1, N-1 if (A(I) is local) then call Send(A(I) to Owner( C(INDEX(I)) )) call Send(B(I) to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP1 from Owner( A(I) )) call Receive(TMP2 from Owner( A(I) )) C(INDEX(I)) = TMP1 * TMP2 end if END DO DO I = 1, N-1 if (A(I) is local) then TMP = A(I) * B(I) call Send(TMP to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP from Owner( A(I) )) C(INDEX(I)) = TMP end if END DO !HPF$ ALIGN (I) WITH A(I) :: B, C !HPF$ INDEPENDENT DO I = 1, N C(INDEX(I)) = A(I) * B(I) END DO !HPF$ ON HOME (A(I)) Frédéric Brégier - LaBRI

I II III IV V VI VII VIII ON HOME A(1,I) + ON HOME A(1,V) 1 2 3 4 5 6 7 8 ON HOME A(2,II) + ON HOME A(2,V) ON HOME A(3,III) Irregular Active Processor Sets A B I II III IV V VI VII VIII !HPF$ ALIGN A(*,K) with B(K) B(K) = Sum(A(K,:)) • Extensions to the ON HOMEdirective: • !HPF$ ON HOME (A(K,:)) !HPF$ ON HOME (A(K,INDEX(K)) !HPF$ ON HOME (A(K,J), J=I:VIII, J .eq. K .or. A(K,J) .ne. 0.0) FORALL(J=I:VIII, J .eq. K .or. A(K,J) .ne. 0.0) • Less active processors in collective communications • Less communications (reduction or broadcast) • Less synchronizations Frédéric Brégier - LaBRI

Cholesky Example: TREE and Set (Matrix with 65024 columns) !HPF$ ON HOME (A(K,J), J = 1:K, J.eq.K .or. A(K,J) .ne. 0.0), NEW(TMP), BEGIN I II III IV V VI VII VIII 1 2 3 4 5 6 7 8 !HPF$ END ON DO K = 1, N allocate (TMP(N)) TMP(:) = 0.0 DO J = 1, K-1 IF (A(K,J) .ne. 0.0) THEN CMOD (TMP, A(:,J)) END IF END DO A(:,K) = A(:,K) + TMP(:) CDIV (A(:,K)) END DO !HPF$ INDEPENDENT, REDUCTION (TMP(:)) 2D-Grid 255x255 IBM SP2-LaBRI Frédéric Brégier - LaBRI

Irregular Iteration Space !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, K-1 IF (A(K,J) .ne. 0.0) THEN … END IF END DO !HPF$ DISTRIBUTE A(:,BLOCK) IBM SP2-LaBRI 2D-Grid 255x255 Frédéric Brégier - LaBRI

Plan • Optimizations at compile-time • Irregular Data Structure (IDS) • A Tree to represent an IDS • Optimizations at run-time • Inspection-Execution principles • Irregular communications: irregular active processor sets • Irregular iteration spaces • Scheduling of loops with partial loop-carried dependencies • New data-parallel irregular operation: progressive irregular prefix operation • Conclusion and Perspectives Frédéric Brégier - LaBRI

Loop with Partial Loop-Carried Dependencies • Loop-carried dependencies: DO I = 1, N DO J = 1, I-1 A(I) = A(I) + A(J) END DO END DO • Partial loop-carried dependencies: DO I = 1, N DO J = 1, I-1 IF (TEST(I,J)) THEN A(I) = A(I) + A(J) END IF END DO END DO • Precomputable partial loop-carried dependencies: PPLD Loop • TEST never modified Frédéric Brégier - LaBRI

PPLD Loop !HPF$ ON HOME (A(J), J=I .or. TEST(I,J)) !HPF$ END ON DO I = 1, N B = 0.0 !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B END DO 4 4 Frédéric Brégier - LaBRI

PPLD Loop Scheduling • Associates one iteration with one task • Precomputable Partial Loop-Carried Dependencies = task graph • Scheduling problem: HPF context • Known mapping (HPF data distribution => task mapping) • Data distribution => possible multi-processor tasks • « Scheduling multi-processor tasks on dedicated processors » • Related Work: • Complexity: Drozdowski 97, Krämer 95: NP-Hard Problem • Wennink 95: Scheduling algorithm • PYRROS / RAPID libraries: precomputable task graph with mono-processor tasks (inspection-execution) Frédéric Brégier - LaBRI

Scheduling Tasks Associated to a PPLD Loop 1) DAG Generation New SCHEDULE directive 2) Scheduling Simple and Wennink’s scheduling 3) Execution Static execution / Dynamic execution Single thread / Multi-thread execution 4) Experimental Results Frédéric Brégier - LaBRI

SCHEDULE directive 1 4 2 2 3 3 5 5 8 8 6 6 7 7 9 9 11 11 10 10 Dependencies between iterations (inspection-execution): !HPF$ SCHEDULE (J = 1:I-1, TEST(I,J) ) DO I = 1, N !HPF$ ON HOME (A(J), J=I .or. TEST(I,J)) B = 0.0 !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B !HPF$ END ON END DO 1 4 Frédéric Brégier - LaBRI

Distributed Scheduling Algorithms 1 4 1 2 3 5 2 3 5 8 2 3 5 6 7 9 9 2 3 5 9 10 4 3 3 3 2 1 11 10 10 • Simple Scheduling: local tasks only 4 1 a d a b a d 2 a c 3 5 3 3 d 3 a d 9 b c c 2 List for task execution 10 1 b c a b c d 1 Order in task scheduling: priority criteria based on critical path Problem of scheduling coherence between processors: prevent deadlock By step scheduling algorithm Frédéric Brégier - LaBRI

Scheduling 1 4 2 3 5 8 6 7 9 1 2 3 5 9 10 11 10 • Wennink’s Scheduling: multi-processor tasks + insertion principle Simple: Wennink: 1 2 3 5 9 10 2 Complexity: Simple Wennink Computations O(N log N) O(N²) Memory O(|E|) O(N² + |E|) Frédéric Brégier - LaBRI

Static execution / Dynamic execution 4 t1 4 t4 1 1 4 4 3 t2 t3 3 3 t5 3 t8 1 1 1 2 2 3 3 5 5 8 8 2 2 2 t6 2 2 t7 t9 2 3 3 3 6 6 7 7 9 9 10 10 11 8 5 2 9 11 10 2 6 8 5 7 9 10 6 6 10 10 6 7 10 11 8 10 3 9 11 5 11 11 6 6 10 7 3 2 3 5 5 5 9 9 9 10 10 10 4 4 4 t11 1 1 t10 E 11 11 10 10 • HPF context: task costs not known at compile-time => unit costs • Static Critical Path = longest path (in edges) to the virtual « End » vertex Static Scheduling: static order of execution a b c d • Iterative program: first iteration records times, then re-scheduling • Dynamic Scheduling Frédéric Brégier - LaBRI

Single Thread / MultiThread execution 0 1 2 Task K’ Task K Task K’ Task K • 2 independent tasks on the same processor • Same priority: which task first ? • Single Thread: the lower rank first • MultiThread: both • User mode thread system: Marcel from PM² HighPerf Computations Waiting for communication Communications Overlapping communications by computations Frédéric Brégier - LaBRI

Experimental Results: Matrix with 261121 columns • Cholesky on sparse matrix with column-block access • Irregular data structure: TREE • Distribution: INDIRECT (minimizing communications) • VSet: V0 + Set • Stat: VSet+SCHEDULE (static simple scheduling) • Dyn: VSet+SCHEDULE (dynamic simple scheduling) • Stat_th: Stat + Threads • W: VSet+SCHEDULE (dynamic Wennink’s scheduling) 2D-Grid 511x511 IBM SP2-LaBRI Frédéric Brégier - LaBRI

Plan • Optimizations at compile-time • Irregular Data Structure (IDS) • A Tree to represent an IDS • Optimizations at run-time • Inspection-Execution principles • Irregular communications: irregular active processor sets • Irregular iteration spaces • Scheduling of loops with partial loop-carried dependencies • New data-parallel irregular operation: progressive irregular prefix operation • Conclusion and Perspectives Frédéric Brégier - LaBRI

Irregular Progressive PREFIX Operation • Irregular Progressive PREFIX Operation: found in PPLD Loop • Exploit independencies with specific communication schemes • Irregular Coefficient: Frédéric Brégier - LaBRI

Irregular Progressive PREFIX Operation 3 4 5 6 1 2 3 4 5 Synchronous REDUCTION 1 2 Asynchronous communication 6 Frédéric Brégier - LaBRI

Irregular Progressive PREFIX Operation PREFIX directive/clause: differs from REDUCTION clause !HPF$ PREFIX(B) DO I = 1, N B = 0.0 DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B END DO DO I = 1, N DO J = I+1, N IF (TEST(J,I)) THEN A(J) = A(J) + A(I) END IF END DO END DO !HPF$ INDEPENDENT, REDUCTION(B) !HPF$ INDEPENDENT, PREFIX(B) Inspection(A,TEST) DO I = lb, ub (ON HOME A(I)) Finalize(A(I)) (receive contributions prev. send) DO J = I+1, N IF (TEST(J,I)) THEN A’(J) = A’(J) + A(I) (send when ready) END IF END DO END DO DO I = 1, N (Set(I)) B = 0.0 DO J = lb, ub (ON HOME A(J)) IF (TEST(I,J)) THEN B = B+ A(J) END IF END DO A(I) = A(I) + REDUCTION(B) END DO IBM SP2-LaBRI Frédéric Brégier - LaBRI

Irregular Progressive PREFIX Operation: Cholesky Example Irregular coef. = 0.1% IBM SP2-LaBRI 2D-Grid 511x511 Frédéric Brégier - LaBRI

TREE: Irregular Data Structure, more information at compile-time • Locality and dependence analysis => TriDenT • Inspection/Execution: Still information not known at compile-time • => CoLUMBO • Irregular Active Processor Sets: fundamental inspection/execution • Up to a factor of 10 • Irregular Iteration Space: minor improvement • Loop with Partial Loop Carried Dependencies: • DAG associated with loop iterations • Semi-automatic task scheduling at run-time • PREFIX operation • Inspection costs repayed with only one iteration • Experimental Results: Efficiency close to hand-made codes • (time ratio between 1.25 and 2.5) Conclusion Frédéric Brégier - LaBRI

Perspectives • Integration in a HPF compiler: preliminary experiments • TREE: ADAPTOR • Set inspection/execution, PREFIX inspection/execution: • NESTOR (Silber 98) • Transposition to other parallel languages: • Irregular Data Structures: always a problem => TREE • Irregular iteration space • OpenMP: Virtual shared memory => Data distribution • Irregular active processor sets Frédéric Brégier - LaBRI

Optimizing HPF Extensions for Irregular Data Structures in Parallel Programs

Optimizing HPF Extensions for Irregular Data Structures in Parallel Programs

Presentation Transcript

From Texas to Antarctica

The Economic Benefits of Preservation: A National Perspective

Connections

Implementation of the Seat Belt Use Act of 1999 (R.A. No. 8750)