1 / 48

Issues in Translation of High Performance Fortran

Issues in Translation of High Performance Fortran. Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 dbc@npac.syr.edu. Goals of this lecture. Discuss translation of some elementary HPF examples to MPI code. Illustrate the need for a Distributed Array Descriptor (DAD).

kiara-foley
Download Presentation

Issues in Translation of High Performance Fortran

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Translation of High Performance Fortran Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 dbc@npac.syr.edu

  2. Goals of this lecture • Discuss translation of some elementary HPF examples to MPI code. Illustrate the need for a Distributed Array Descriptor (DAD). • Develop an abstract model of a DAD, and show how it can be used to translate simple codes.

  3. Contents of Lecture • Introduction. • Translation of simple HPF fragment to SPMD. • The problem of procedures. • Requirements for an array descriptor. • Groups. • Process grids. • Restricted groups. • Range objects. • A DAD

  4. A simple HPF program • Here is a simple HPF program: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P FORALL (I = 1:50) A(I) = 1.0 * I • We want to translate this to an MPI program.

  5. Translation of simple program INTEGER W_RANK, W_SIZE, ERRCODE INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3)/4) REAL A(BLK_SIZE) INTEGER BLK_START, BLK_COUNT INTEGER L, I CALL MPI_INIT(ERRCODE) CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE) IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDO ENDIF CALL MPI_FINALIZE(ERRCODE)

  6. Setting up the environment • Associated code: INTEGER W_RANK, W_SIZE, ERRCODE . . . CALL MPI_INIT(ERRCODE) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_RANK, ERRCODE) CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) . . . CALL MPI_FINALIZE(ERRCODE)

  7. Allocating segment of the distributed array • Associated statements are: INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3)/4) REAL A(BLK_SIZE) • Segment size is 50/4

  8. Testing this processor holds a segment • Associated code is: IF (W_RANK < 4) THEN . . . ENDIF • Assumes number of MPI processes is at least the size of the largest processor arrangement of HPF program.

  9. Computing parameters of locally held segment • Associated code: INTEGER BLK_START, BLK_COUNT . . . BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF • BLK_START—position in global index space. BLK_COUNT—elements in segment.

  10. Loop over local elements • Associated code: INTEGER L, I . . . DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDO

  11. An HPF procedure • Superficially similar program: SUBROUTINE INIT(D) REAL D(50) !HPF$ INHERIT D FORALL (I = 1:50) D(I) = 1.0 * I END • INHERIT directive means mapping of dummy should be same as actual, whatever that is.

  12. Procedure call with block-distributed actual !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A) • Mapping of D:

  13. Procedure call with cyclically distributed actual !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P CALL INIT(A) • Mapping of D:

  14. Procedure call with strided alignment of actual !HPF$ PROCESSORS P(4) REAL A(100) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A(1:100:2)) • Mapping of D:

  15. Procedure call with row-aligned actual !HPF$ PROCESSORS Q(2, 2) REAL A(6, 50) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q CALL INIT(A(2, :)) • Mapping of D:

  16. The problem • Somehow INIT must be translated to deal with data having any of these decompositions, or any legal HPF mapping. Actual mapping not known until run-time. • Not an artificial example. Libraries that operate on distributed arrays (eg the communication libraries discussed later) must deal with exactly this situation.

  17. Requirements for an array descriptor • Seems that to translate procedure calls, need some non-trivial data structure to describe layout of actual argument. • The Distributed Array Descriptor (DAD). • Want to understand requirements and best organization of a DAD. • Adopt object-oriented principles to build an abstract design.

  18. Distributed array dimensions • Obvious structural feature of HPF array: multidimensional. • Each dimension mapped independently as: • Collapsed (serial), • Simple block distribution, • Simple cyclic distribution, • Block cyclic distribution, • General block distribution (HPF 2.0), • Linear alignment to any of above.

  19. Converting block distribution to cyclic distribution BLK_SIZE = (N + NP – 1) / NP . . . BLK_START = R * BLK_SIZE . . . IF (N – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (N – BLK_START > 0) THEN BLK_COUNT = N – BLK_START ELSE BLK_COUNT = 0 ENDIF . . . I = BLK_START + L BLK_SIZE = (N + NP – 1) / NP . . . BLK_START = R . . . BLK_COUNT = (N – R + NP – 1) / NP . . . I = BLK_START + NP * (L - 1) + 1

  20. Distributed ranges • Have different kinds of array dimension (distribution format). • Each kind of dimension has a different set of formulae for segment layout, index computation, etc. • OO interpretation: virtual functions on a class hierarchy. • Implement as the Range hierarchy. • DAD for rank-r array will contain r Range objects, one per dimension.

  21. Dealing with “hidden” dimensions of sections • Array may be mapped to slice of grid: • Rank-1 section only has one range object. Need some other structure to represent embedding in subgrid.

  22. DAD groups • Need a group concept similar to MPI_Group. • Want lightweight structure for representing arbitrary slices of process grids. • Object representing grid itself needs multidimensional structure (cf Cartesian Communicator in MPI).

  23. Representing processor arrangements • In OO runtime descriptor, expect entity like processor arrangement becomes an object. • Use C++ for definiteness: !HPF$ PROCESSORS P(4) becomes Procs1 p(4); and !HPF$ PROCESSORS Q(2, 2) becomes Procs2 q(2, 2);

  24. Hierarchy of process grids

  25. Interface of Procs and Dimension class Procs { public: int member() const; Dimension dim(const int d) const; . . . }; class Dimension { public: int size() const; int crd() const; . . . };

  26. Using Procs in translation INTEGER W_RANK, . . . CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) . . . IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE . . . ENDIF Procs1 p(4); . . . if (p.member()) { blk_start = p.dim(0).crd() * blk_size; . . . } Becomes:

  27. Restricted process groups • Slice of process grid to which array section may be mapped. • Portion of grid selected by specifying subset of dimension coordinates. • Lightweight representation. Use bitmask to represent dimension set.

  28. Example restricted groups in 2-dimensional grid

  29. Representation of subgrids example dimension lead tuple set process a) {dim(0), dim(1)} 0 (p, 11 , 0) 2 b) {dim(0)} 8 (p, 10 , 8) 2 c) {dim(1)} 1 (p, 01 , 1) 2 d) {} 6 (p, 00 , 6) 2

  30. The Group class class Group { public: Group(const Procs& p); void restrict(Dimension d, const int coord); int member() const; . . . } • Lightweight—implementation in about 3 words. Can freely copy and discard. DAD contains a Group object.

  31. Ranges • In DAD, range object describes extent and distribution format of one array dimension. • Expect a class hierarchy of ranges. • Each subclass corresponds to a different kind of distribution format for an array dimension.

  32. A hierarchy of ranges

  33. Interface of the Range class Class range { public: int size() const; Dimension dim() const; int volume() const; Range subrng(const int extent, const int base, const int stride = 1) const; void block(Block* blk, const int crd) const; void location(Location* loc, const int glb) const; . . . };

  34. Translating simple HPF program to C++ Translation: Procs1 p(4); BlockRange x(50, p.dim(0)); float* a = new float [x.volume()]; if (p.member()) { Block b; x.block(&b, p.dim(0).crd()); for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l + 1; a [b.sub_bas + b.sub_stp * l] = 1.0 * i; } } Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P FORALL (I = 1:50) A(I) = 1.0 * I

  35. Features of C++ translation • Arguments of BlockRange constructor are process dimension and extent of range. • Fields of Block define count of local loop and base and step for local subscript and global index. • If distribution directive is changed to: !HPF$ DISTRIBUTE A(CYCLIC) ONTO P only change is x declaration becomes: CyclicRange x(50, p.dim(0)); —apparently making progress toward writing code that works for any distribution.

  36. The Block and Location structures struct Block { int count; int glb_bas; int glb_stp; int sub_bas; int sub_stp; }; struct Location { int sub; int crd; . . . };

  37. Memory strides Fortran 90 program: REAL B(100, 100) . . . CALL FOO(B(1, :)) SUBROUTINE FOO(C ) REAL C(:) . . . END • First dimension of D most-rapidly-varying in memory. • Second dimension has memory stride 100—inherited by C. • Fortran compilers normally pass a dope vector containing r extents and r strides for rank-r argument. • Stride not really a property of the distributed range. Store separately in DAD.

  38. A DAD • Abstract DAD for a rank-r array is an object containing: • A distribution group, and • r range objects, and • r integer strides.

  39. Interface of the DAD class Struct DAD { DAD(const int _rank, const Group& _group, Map _maps []); const Group& grp() const; Range rng(const int d) const; int str(const int d) const; . . . };

  40. Map structure struct Map { Map(Range _range, const int _stride); Range range; int stride; };

  41. Translating HPF program with inherited mapping Translation: void init(float* d, DAD* d_dad) { Group p = d_dad->grp(); if (p.member()) { Range x = d_dad->rng(0); int s = d_dad->str(0); Block b; x.block(&b, p.dim(0).crd()); for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l + 1; d [s * (b.sub_bas + b.sub_stp * l)] = 1.0 * i; } } } Source: SUBROUTINE INIT(D) REAL D(50) !HPF$ INHERIT D FORALL (I = 1:50) D(I) = 1.0 * I END

  42. Translation of call with block-distributed actual Translation: Procs1 p(4); BlockRange x(50, p.dim(0)); float* a = new float [x.volume()]; Map maps [1]; maps [0] = Map(x, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A)

  43. Translation of call with cyclically distributed actual Translation: Procs1 p(4); CyclicRange x(50, p.dim(0)); float* a = new float [x.volume()]; Map maps [1]; maps [0] = Map(x, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P CALL INIT(A)

  44. Translation of call with strided alignment of actual Translation: Procs1 p(4); BlockRange x(100, p.dim(0)); float* a = new float [x.volume()]; // Create DAD for section a(::2) Range x2 = x.subrng(50, 0, 2); Map maps [1]; maps [0] = Map(x2, 1); DAD dad(1, p, maps); init(a, &dad); Source: !HPF$ PROCESSORS P(4) REAL A(100) !HPF$ DISTRIBUTE A(BLOCK) ONTO P CALL INIT(A(1:100:2))

  45. Translation of call with row-aligned actual Translation: Procs2 q(2, 2); BlockRange x(6, q.dim(0)), y(50, q.dim(1)); float* a = new float [x.volume() * y.volume()]; // Create DAD for section a(1, :) Location i; x.location(&i, 1); Group p = q; p.restrict(q.dim(0), i.crd); Map maps [1]; maps [0] = Map(y, x.volume()); DAD dad(1, p, maps); init(a + i.sub, &dad); Source: !HPF$ PROCESSORS Q(2, 2) REAL A(6, 50) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q CALL INIT(A(2, :))

  46. Other features of the Adlib DAD • Support for block-cyclic distributions. Local loops traversing distributed data need outer loop over set of local blocks. LocBlocksIndex iterator class. offset() method computes overall memory offset. • Support for ghost extension, other memory layouts. Shift in memory for ghost region not in local subscript (universal—memory-layout-independent). disp(), offset(), step() methods applied to local subscript.

  47. Other features of the Adlib DAD, II • Support for loops over subranges. Additional block() methods take triplet arguments—directly traverse subranges. crds() methods define ranges of coordinates where local blocks actually exist. • Other feature to support communication library. AllBlocksIndex. • Miscellaneous inquires and predicates. Useful in general libraries, and for runtime checking programs for correctness.

  48. Next Lecture: • Communication in Data Parallel Languages • Patterns of communication needed to implement language constructs. • Libraries that support these communication patterns.

More Related