Introduction to MPI Programming (Part III) ‏ Michael Griffiths, Deniz Savas & Alan Real

Introduction to MPI Programming (Part III)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006

Overview • Review blocking and non-blocking communications • Collective Communication • Broadcast, Scatter & Gather of data • Reduction Operations • Barrier Synchronisation • Processor topologies • Patterns for Parallel Programming • Exercises

Blocking operations • Relate to when the operation has completed • Only return from the subroutine call when the operation has completed

Non-blocking communication • Separate communication into three phases: • Initiate non-blocking communication • Do some work: • Perhaps involving other communications • Wait for non-blocking communication to complete.

Collective Communications(one for all, all for one!!!)‏ • Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories: • Barrier synchronisation • Broadcast from one to all. • Scatter from one to all • Gather from all to one. • Scatter/Gather. From all to all. • Global reduction (distribute elementary operations)‏ • IMPORTANT NOTE: Collective Communication operations and point-to-point operations we have seen earlier are invisible to each other and hence do not interfere with each other. • This is important to avoid dead-locks due to interference.

Timers • Double precision MPI functions • Fortran, DOUBLE PRECISION t1: t1 = MPI_WTIME(); • C double t1: t1 = MPI_Wtime(); • C++ double t1: t1 = MPI::Wtime(); • Time is measured in seconds. • Time to perform a task is measured by consulting the timer before and after.

Practice Session 4: diffusion example • Arrange processes to communicate round a ring. • Each process stores a copy of its rank in an integer variable. • Each process communicates this value to its right neighbour and receives a value from its left neighbour. • Each process computes the sum of all the values received. • Repeat for the number of processes involved and print out the sum stored at each process.

Generating Cartesian Topologies • MPI_Cart_create • Makes a new communicator to which topology information has been attached • MPI_Cart_coords • Determines process coords in cartesian topology given rank in group • MPI_Cart_shift • Returns the shifted source and destination ranks, given a shift direction and amount

MPI_Cart_create syntax • Fortran INTEGER comm_old, ndims, dims(*), comm_cart, ierror logical periods(*), reorder CALL MPI_CART_CREATE(comm_old, ndims, dims, periods, reorder, comm_cart, ierror) • C: MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart ); • C++: MPI::Intracomm::Create_cart (int ndims, const int dims[], const bool periods[], bool reorder);

MPI_Comm_rank Syntax MPI_Comm_rank - Determines the rank of the calling process in the communicator. • int MPI_Comm_rank(MPI_Comm comm, int *rank) • MPI_COMM_RANK(COMM,RANK,IERROR) • int Comm::Get_rank() const

Transform Rank to CoordinatesMPI_Cart_coords syntax • Fortran CALL MPI_CART_COORDS(INTEGER COMM,INTEGER RANK,INTEGER MAXDIMS,INTEGER COORDS(*),INTEGER IERROR) • C: int MPI_Cart_coords(MPI_Comm comm,int rank,int maxdims,int *coords); • C++: void MPI::Cartcomm::Get_coords(int rank, int maxdims, int coords[]) const;

Transform Coordinatesto RankMPI_Cart_rank syntax • Fortran CALL MPI_CART_RANK(INTEGER COMM, INTEGER COORDS(*),INTEGER) • C: int MPI_Cart_rank(MPI_Comm comm, int *coords,int *rank); • C++: void MPI::Cartcomm::Get_rank(int coords[],int *rank) const;

MPI_Cart_shift syntax • Fortran MPI_CART_SHIFT(INTEGER COMM,INTEGER DIRECTION,INTEGER DISP, INTEGER RANK_SOURCE,INTEGER RANK_DEST,INTEGER IERROR)‏ • C: int MPI_Cart_shift(MPI_Comm comm,int direction,int disp,int *rank_source,int *rank_dest); • C++: void MPI::Cartcomm::Shift(int direction, int disp, int &rank_source, int &rank_dest) const;

Mapping 4x4 Cartesian Topology Onto Processor Ranks

Topologies: Examples • See Diffusion example • See cartesian example

Examples for Parallel Programming • Master slave • E.g. share work example • Example ising model • Communicating Sequential Elements Pattern • Poisson equation • Highly coupled processes • Systolic loop algorithm • E.g. md example

Poisson Solver Using Jacobi Iteration • Communicating Sequential Elements Pattern • Operations in each component depend on partial results in neighbour components. Thread Thread Thread Slave Slave Slave Data Exchange Data Exchange Thread Thread Thread Slave Slave Slave

Layered Decomposition of 2d Array • Distribute 2d array across processors • Processors store all columns • Rows allocated amongst processors • Each proc has left proc and right proc • Each proc has max and min vertex that it stores • Uijnew=(Ui+1j+Ui-1j+Uij+1+Uij-1)/4 • Each proc has a “ghost” layer • Used in calculation of update (see above)‏ • Obtained from neighbouring left and right processors • Pass top and bottom layers to neighbouring processors • Become neighbours ghost layers • Distribute rows over processors N/nproc rows per proc • Every processor stores all N columns

N+1 N Processor 1 p1min p2max p1min p2max Processor 2 p2min Send top layer p3max Receive bottom layer p2min p3max Processor 3 Send bottom layer Receive top layer Processor 4 1 N+1

Master Slave Thread • A computation is required where independent computations are performed, perhaps repeatedly, on all elements of some ordered data. • Example • Image processing perform computation on different sets of pixels within an image Data Exchange Slave Thread Slave Master Thread Slave

Highly Coupled Efficient Element Exchange • Highly Coupled Efficient Element Exchange using Systolic loop techniques • Extreme example of Communicating Sequential Elements Pattern

Systolic Loop • Distribute Elements Over Processors • Three buffers • Local elements • Travelling Elements (local elements at start)‏ • Send buffer • Loop over number of processors • Transfer travelling elements • Interleave send/receive to prevent deadlock • Send contents of send buffer to next proc • Receive buffer from previous proc to travelling elements • Point travelling elements to send buffer • Allow local elements to interact with travelling elements • Accumulate reduced computations over processors

Systolic Loop Element Pump First cycle of 3 for 4 processor systolic loop Proc 2 Proc 1 Proc 3 Proc 4 Local Elements Local Elements Local Elements Local Elements Moving Elements (from 1)‏ Moving Elements (from 4)‏ Moving Elements (from 2)‏ Moving Elements (from 3)‏

Practice Sessions 5 and 6 • Defining and Using Processor Topologies • Patterns for parallel computing

Further Information • All MPI routines have a UNIX man page: • Use C-style definition for Fortran/C/C++: • E.g. “man MPI_Finalize” will give correct syntax and information for Fortran, C and C++ calls. • Designing and building parallel programs (Ian Foster)‏ • http://www-unix.mcs.anl.gov/dbpp/ • Standard documents: • http://www.mpi-forum.org/ • Many books and information on web. • EPCC documents.

Introduction to MPI Programming (Part III) ‏ Michael Griffiths, Deniz Savas & Alan Real