480 likes | 557 Views
This lecture discusses translating elementary HPF examples into MPI code, emphasizing Distributed Array Descriptor (DAD) models, array descriptors, process grids, and more.
E N D
Issues in Translation of High Performance Fortran Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 dbc@npac.syr.edu
Goals of this lecture • Discuss translation of some elementary HPF examples to MPI code. Illustrate the need for a Distributed Array Descriptor (DAD). • Develop an abstract model of a DAD, and show how it can be used to translate simple codes.
Contents of Lecture • Introduction. • Translation of simple HPF fragment to SPMD. • The problem of procedures. • Requirements for an array descriptor. • Groups. • Process grids. • Restricted groups. • Range objects. • A DAD
A simple HPF program • Here is a simple HPF program: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P FORALL (I = 1:50) A(I) = 1.0 * I • We want to translate this to an MPI program.
Translation of simple program INTEGER W_RANK, W_SIZE, ERRCODE INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3)/4) REAL A(BLK_SIZE) INTEGER BLK_START, BLK_COUNT INTEGER L, I CALL MPI_INIT(ERRCODE) CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE) IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDO ENDIF CALL MPI_FINALIZE(ERRCODE)
Setting up the environment • Associated code: INTEGER W_RANK, W_SIZE, ERRCODE . . . CALL MPI_INIT(ERRCODE) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_RANK, ERRCODE) CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) . . . CALL MPI_FINALIZE(ERRCODE)
Allocating segment of the distributed array • Associated statements are: INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3)/4) REAL A(BLK_SIZE) • Segment size is 50/4
Testing this processor holds a segment • Associated code is: IF (W_RANK < 4) THEN . . . ENDIF • Assumes number of MPI processes is at least the size of the largest processor arrangement of HPF program.
Computing parameters of locally held segment • Associated code: INTEGER BLK_START, BLK_COUNT . . . BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF • BLK_START—position in global index space. BLK_COUNT—elements in segment.
Loop over local elements • Associated code: INTEGER L, I . . . DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDO
An HPF procedure • Superficially similar program: SUBROUTINE INIT(D) REAL D(50) !HPF$ INHERIT D FORALL (I = 1:50) D(I) = 1.0 * I END • INHERIT directive means mapping of dummy should be same as actual, whatever that is.
Procedure call with block-distributed actual !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A) • Mapping of D:
Procedure call with cyclically distributed actual !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P CALL INIT(A) • Mapping of D:
Procedure call with strided alignment of actual !HPF$ PROCESSORS P(4) REAL A(100) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A(1:100:2)) • Mapping of D:
Procedure call with row-aligned actual !HPF$ PROCESSORS Q(2, 2) REAL A(6, 50) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q CALL INIT(A(2, :)) • Mapping of D:
The problem • Somehow INIT must be translated to deal with data having any of these decompositions, or any legal HPF mapping. Actual mapping not known until run-time. • Not an artificial example. Libraries that operate on distributed arrays (eg the communication libraries discussed later) must deal with exactly this situation.
Requirements for an array descriptor • Seems that to translate procedure calls, need some non-trivial data structure to describe layout of actual argument. • The Distributed Array Descriptor (DAD). • Want to understand requirements and best organization of a DAD. • Adopt object-oriented principles to build an abstract design.
Distributed array dimensions • Obvious structural feature of HPF array: multidimensional. • Each dimension mapped independently as: • Collapsed (serial), • Simple block distribution, • Simple cyclic distribution, • Block cyclic distribution, • General block distribution (HPF 2.0), • Linear alignment to any of above.
Converting block distribution to cyclic distribution BLK_SIZE = (N + NP – 1) / NP . . . BLK_START = R * BLK_SIZE . . . IF (N – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (N – BLK_START > 0) THEN BLK_COUNT = N – BLK_START ELSE BLK_COUNT = 0 ENDIF . . . I = BLK_START + L BLK_SIZE = (N + NP – 1) / NP . . . BLK_START = R . . . BLK_COUNT = (N – R + NP – 1) / NP . . . I = BLK_START + NP * (L - 1) + 1
Distributed ranges • Have different kinds of array dimension (distribution format). • Each kind of dimension has a different set of formulae for segment layout, index computation, etc. • OO interpretation: virtual functions on a class hierarchy. • Implement as the Range hierarchy. • DAD for rank-r array will contain r Range objects, one per dimension.
Dealing with “hidden” dimensions of sections • Array may be mapped to slice of grid: • Rank-1 section only has one range object. Need some other structure to represent embedding in subgrid.
DAD groups • Need a group concept similar to MPI_Group. • Want lightweight structure for representing arbitrary slices of process grids. • Object representing grid itself needs multidimensional structure (cf Cartesian Communicator in MPI).
Representing processor arrangements • In OO runtime descriptor, expect entity like processor arrangement becomes an object. • Use C++ for definiteness: !HPF$ PROCESSORS P(4) becomes Procs1 p(4); and !HPF$ PROCESSORS Q(2, 2) becomes Procs2 q(2, 2);
Interface of Procs and Dimension class Procs { public: int member() const; Dimension dim(const int d) const; . . . }; class Dimension { public: int size() const; int crd() const; . . . };
Using Procs in translation INTEGER W_RANK, . . . CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) . . . IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE . . . ENDIF Procs1 p(4); . . . if (p.member()) { blk_start = p.dim(0).crd() * blk_size; . . . } Becomes:
Restricted process groups • Slice of process grid to which array section may be mapped. • Portion of grid selected by specifying subset of dimension coordinates. • Lightweight representation. Use bitmask to represent dimension set.
Representation of subgrids example dimension lead tuple set process a) {dim(0), dim(1)} 0 (p, 11 , 0) 2 b) {dim(0)} 8 (p, 10 , 8) 2 c) {dim(1)} 1 (p, 01 , 1) 2 d) {} 6 (p, 00 , 6) 2
The Group class class Group { public: Group(const Procs& p); void restrict(Dimension d, const int coord); int member() const; . . . } • Lightweight—implementation in about 3 words. Can freely copy and discard. DAD contains a Group object.
Ranges • In DAD, range object describes extent and distribution format of one array dimension. • Expect a class hierarchy of ranges. • Each subclass corresponds to a different kind of distribution format for an array dimension.
Interface of the Range class Class range { public: int size() const; Dimension dim() const; int volume() const; Range subrng(const int extent, const int base, const int stride = 1) const; void block(Block* blk, const int crd) const; void location(Location* loc, const int glb) const; . . . };
Translating simple HPF program to C++ Translation: Procs1 p(4); BlockRange x(50, p.dim(0)); float* a = new float [x.volume()]; if (p.member()) { Block b; x.block(&b, p.dim(0).crd()); for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l + 1; a [b.sub_bas + b.sub_stp * l] = 1.0 * i; } } Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P FORALL (I = 1:50) A(I) = 1.0 * I
Features of C++ translation • Arguments of BlockRange constructor are process dimension and extent of range. • Fields of Block define count of local loop and base and step for local subscript and global index. • If distribution directive is changed to: !HPF$ DISTRIBUTE A(CYCLIC) ONTO P only change is x declaration becomes: CyclicRange x(50, p.dim(0)); —apparently making progress toward writing code that works for any distribution.
The Block and Location structures struct Block { int count; int glb_bas; int glb_stp; int sub_bas; int sub_stp; }; struct Location { int sub; int crd; . . . };
Memory strides Fortran 90 program: REAL B(100, 100) . . . CALL FOO(B(1, :)) SUBROUTINE FOO(C ) REAL C(:) . . . END • First dimension of D most-rapidly-varying in memory. • Second dimension has memory stride 100—inherited by C. • Fortran compilers normally pass a dope vector containing r extents and r strides for rank-r argument. • Stride not really a property of the distributed range. Store separately in DAD.
A DAD • Abstract DAD for a rank-r array is an object containing: • A distribution group, and • r range objects, and • r integer strides.
Interface of the DAD class Struct DAD { DAD(const int _rank, const Group& _group, Map _maps []); const Group& grp() const; Range rng(const int d) const; int str(const int d) const; . . . };
Map structure struct Map { Map(Range _range, const int _stride); Range range; int stride; };
Translating HPF program with inherited mapping Translation: void init(float* d, DAD* d_dad) { Group p = d_dad->grp(); if (p.member()) { Range x = d_dad->rng(0); int s = d_dad->str(0); Block b; x.block(&b, p.dim(0).crd()); for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l + 1; d [s * (b.sub_bas + b.sub_stp * l)] = 1.0 * i; } } } Source: SUBROUTINE INIT(D) REAL D(50) !HPF$ INHERIT D FORALL (I = 1:50) D(I) = 1.0 * I END
Translation of call with block-distributed actual Translation: Procs1 p(4); BlockRange x(50, p.dim(0)); float* a = new float [x.volume()]; Map maps [1]; maps [0] = Map(x, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A)
Translation of call with cyclically distributed actual Translation: Procs1 p(4); CyclicRange x(50, p.dim(0)); float* a = new float [x.volume()]; Map maps [1]; maps [0] = Map(x, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P CALL INIT(A)
Translation of call with strided alignment of actual Translation: Procs1 p(4); BlockRange x(100, p.dim(0)); float* a = new float [x.volume()]; // Create DAD for section a(::2) Range x2 = x.subrng(50, 0, 2); Map maps [1]; maps [0] = Map(x2, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(100) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A(1:100:2))
Translation of call with row-aligned actual Translation: Procs2 q(2, 2); BlockRange x(6, q.dim(0)), y(50, q.dim(1)); float* a = new float [x.volume() * y.volume()]; // Create DAD for section a(1, :) Location i; x.location(&i, 1); Group p = q; p.restrict(q.dim(0), i.crd); Map maps [1]; maps [0] = Map(y, x.volume()); DAD dad(1, p, maps); init(a + i.sub, &dad); Source: !HPF$ PROCESSORS Q(2, 2) REAL A(6, 50) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q CALL INIT(A(2, :))
Other features of the Adlib DAD • Support for block-cyclic distributions. Local loops traversing distributed data need outer loop over set of local blocks. LocBlocksIndex iterator class. offset() method computes overall memory offset. • Support for ghost extension, other memory layouts. Shift in memory for ghost region not in local subscript (universal—memory-layout-independent). disp(), offset(), step() methods applied to local subscript.
Other features of the Adlib DAD, II • Support for loops over subranges. Additional block() methods take triplet arguments—directly traverse subranges. crds() methods define ranges of coordinates where local blocks actually exist. • Other feature to support communication library. AllBlocksIndex. • Miscellaneous inquires and predicates. Useful in general libraries, and for runtime checking programs for correctness.
Next Lecture: • Communication in Data Parallel Languages • Patterns of communication needed to implement language constructs. • Libraries that support these communication patterns.