Parallel Computing Using MPI

Parallel Computing Using MPI AnCAD, Inc. 逸奇科技

Main sources of the Material • MacDonald, N., Minty, E., Harding, T. and Brown, S. “Writing Message-Passing Parallel Programs with MPI,” Edinburgh Parallel Computing Centre, Germany. (Available at http://www.lrz-muenchen.de/services/software/parallel/mpi/epcc-course/) • “MPICH-A Portable MPI Implementation,” Web page available at http://www-unix.mcs.anl.gov/mpi/mpich/, Argonne National Laboratory, USA. • “Message Passing Interface (MPI) Forum,” Web page available at http://www.mpi-forum.org/, Message Passing Interface Forum. • Group, W., Lusk, E. and Skjellum, A. (1999). Using MPI Portable Parallel Programming with the Message-Passing Interface, MIT Press, England. • Grama, A., Gupta, A., Karypis, G. and Kumar, V. (2003). Introduction to Parallel Computing 2nd Edition, Addison Wesley, USA.

Computational Science • Computer • A stimulation of growth of science • Computational Science • A branch of science between Theoretical Science and Experimental Science A nuclear surface test A supercomputer system (ASCI Red)

Trends of Computing (1) • Requirements never stop • Scientific computing • Fluid dynamics • Structural analysis • Electromagnetic analysis • Optimization of design • Bio-informatics • Astrophysics • Commercial applications • Network security • Web and database data mining • Optimizing business and market decisions F-117 fighter (2D analysis in 1970s) B-2 bomber (3D analysis in 1980s)

Trends of Computing (2) • CPU performance • Moore’s Law • 18 months  • Twice of transistors • Single CPU’s performance grows exponentially • The cost of advanced single-CPU computer increases rapidly than its power Progress of Intel CPUs Typical cost-performance curves (a sketch)

Trends of Computing (3) • Solutions • A bigger supercomputer, or • A cluster of small computers • Today’s supercomputers • Also use multiple processors • Needs parallel computing skills to exploit their performance • A cluster of small computers • Less budget • More flexibility • Needs more knowledge on parallel computing Avalon (http://www.lanl.gov)

Parallel Computer Architectures • Shared-memory architecture • Typical SMP (Symmetric Multi-Processing) • Programmers use threads or OpenMP • Distributed-memory architecture • Needs message passing among processors • Programmers use MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) Shared-memory architecture Distributed-memory architecture

Distributed Shared-Memory Computers • Assemblages of shared-memory parallel computers • Constellations: • Np < Nn*Nn • Clusters or MPP: • Np >= Nn*Nn A Constellation Np: # of processors (CPUs) Nn: # of nodes (shared- mem. computers) MPP: Massively Parallel Processors ASCI Q (http://www.lanl.gov) A Cluster or MPP

Trend of Parallel Architectures • Clusters get popular rapidly. • Clusters get in TOP500 list since 1997. • More than 40% of TOP500 are clusters in Nov, 2003. • 170 are Intel-based clusters. • More than 70% of TOP500 are message-passing based. • All TOP500 supercomputers requires message passing to exploit their performance. Trend of Top500 supercomputers’ architectures (http://www.top500.org)

Top 3 of the TOP500 (Nov. 2003) • Earth Simulator • 35.86-Tera Flops • Earth Simulator Center (Japan) • ASCI Q • 13.88-Tera Flops • Los Alamos Nat’l Lab. (USA) • Virginia Tech’s X • 10.28-Tera Flops • Virginia Tech (USA) (http://www.top500.org)

Earth Simulator • For climate simulation • Massively Parallel Processors • 640 computing nodes • Each node: • 8 NEC vector CPUs • 16 GB shared memory • Network • 16 GB/s inter-node bandwidth • Parallel tools: • Automatic vectorizing in CPU • OpenMP in each node • MPI or HPF across nodes Earth Simulator site Earth Simulator interconnection (http://www.es.jamstec.go.jp/)

ASCI Q • For numerical nuclear testing • Cluster • Each node has independent operating system • 2048 computing nodes • An option: adding to 3072 • Each node: • 4 HP Alpha processors • 10 GB memory • Network • 250 MB/s of bandwidth • ~5 µs of latency time ASCI Q ASCI Q site (http://www.lanl.gov/asci)

Virginia Tech’s X • Scientific computing • Cluster • 1100 computing nodes • Each node: • 2 2GHz-PowerPC CPUs • 4 GB memory • Mac OS X • Network • 1.25 GB/s of bandwidth • 4.5 µs of latency time • Parallel Tools • MPI (http://www.eng.vt.edu/tcf/)

MPI is a … • Library • Not a language • Functions or subroutines for Fortran and C/C++ programmers • Specification • Not a particular implementation • Computer vendors may provide specific implementations. • Free and open source code may be available. MPI_Init() MPI_Comm_rank() MPI_Comm_size() MPI_Send() MPI_Recv() MPI_Finalize() (http://www.mpi-forum.org/)

Six-Function Version of MPI • MPI_Init • Initialize MPI • A must-do function at the beginning • MPI_Comm_size • Number of processors (Np) • MPI_Comm_rank • “This” processor’s ID • MPI_Send • Sending data to a (specific) processor • MPI_Recv • Receiving data from a (specific) processor • MPI_Finalize • Terminate MPI • A must-do function at the end

Message-Passing Programming Paradigm MPI on message-passing architecture (Edinburgh Parallel Computing Centre)

Message-Passing Programming Paradigm MPI on shared-memory architecture (Edinburgh Parallel Computing Centre)

Single Program Multiple Data • Single program • The same program runs on each processor. • Single program eases compilation and maintenance. • Local variables • All variables are local. No shared data. • Different at MPI_Comm_rank • Each processor gets different response from MPI_Comm_rank. • Using sequential language • Programmers writes in Fortran / C / C++. • Message passing • Message passing can be the only way to communicate.

Single Program Multiple Data Processor 0 Processor 1 Processor 2 Processor 3

Hello World! void main(int argc, char ** argv) { MPI_Init(&argc,&argv); printf("Hello world.\n"); MPI_Finalize(); } An MPI program Output of the program

Emulating General Message Passing with SPMD void main(int argc, char ** argv){ int Size, Rank; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &Size); MPI_Comm_rank(MPI_COMM_WORLD, &Rank); if (Rank==0) { Controller (argc, argv); } else { Workers (argc, argv); } MPI_Finalize(); } (Edinburgh Parallel Computing Centre)

Emulating General Message Passing with SPMD PROGRAM CALL MPI_INIT(iErr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, iSize, iErr); CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, iErr); IF (iRank.EQ.0) THEN CALL Controller (); ELSE CALL Workers (); END IF MPI_Finalize(iRc); END (Edinburgh Parallel Computing Centre)

MPI Messages • Messages are packets of data moving between processors • The message passing system has to be told the following information: • Sending processor • Location of data in sending processor • Data type • Data length • Receiving processor(s) • Destination location in receiving processor(s) • Destination size

Message Sending Modes • How do we communicate? • Phone • Information is supposed to be synchronized. • When you are talking, you may need to be concentrated. • Fax • Information is supposed to be synchronized. • When faxing, you can do some other works. • After making sure the fax completes, you can leave. • Mail • After sending mail, you have no idea about the completion.

Synchronous/Asynchronous Sends • Synchronous Sends • Completion implies a receipt • Asynchronous Sends • Only knows when the message has left. An asynchronous send (Edinburgh Parallel Computing Centre)

Synchronous Blocking Sends • Only return from the subroutine call when the operation has completed. • Program waits. • Easy and safe. • Probably not efficient. A blocking send (Edinburgh Parallel Computing Centre)

Synchronous Non-Blocking Sends • Return straight away. • Allow program to continue to perform other work. • Program can test or wait. • More flexible. • Probably more efficient. • Programmers should be careful. A non-blocking send (Edinburgh Parallel Computing Centre)

Non-Blocking Sends w/o wait • All non-blocking sends should have matching wait operations. • Do not modify or release the sending data until completion. Non-blocking send without test and wait (Edinburgh Parallel Computing Centre)

Collective Communications • Collective communication routines are higher level routines involving several processes at a time. • Can be built out of point-to-point communications. Broadcasting: a popular collective communication (Edinburgh Parallel Computing Centre)

Collective Communications Broadcasting Reduction (e.g., summation) Barrier (Edinburgh Parallel Computing Centre)

Writing MPI Programs Most of the following original MPI training materials were developed under the Joint Information Systems Committee (JISC) New Technologies Initiative by the Training and Education Centre at Edinburgh Parallel Computing Centre (EPCC-TEC), University of Edinburgh, United Kingdom.

MPI in C and FORTRAN (Edinburgh Parallel Computing Centre)

Initialization of MPI • In C int MPI_Init (int argc, char** argv); • In FORTRAN MPI_INIT(IERROR) • Must be the first routine called (Edinburgh Parallel Computing Centre)

MPI Communicator • A required argument for all MPI communication calls • Processes can communicate only if they share the same communicator(s). • MPI_COMM_WORLD is available • Programmer-defined communicators are allowed. (Edinburgh Parallel Computing Centre)

Size and Rank • Size Q: How many processors are contained in a communicator? A: MPI_Comm_size(MPI_Comm, int *); • Rank Q: How do I identify different processes in my MPI program? A: MPI_Comm_rank(MPI_Comm, int *); (Edinburgh Parallel Computing Centre)

Exiting MPI • In C int MPI_Finalize (); • In FORTRAN MPI_FINALIZE(IERROR) • Must be the last MPI routine called (Edinburgh Parallel Computing Centre)

Contents of Typical MPI Messages • Data to send • Data length • Data type • Destination process • Tag • Communicator (Edinburgh Parallel Computing Centre)

Data Types (Edinburgh Parallel Computing Centre)

Why is Data Type Needed ? • For heterogeneous parallel architecture 3.1416 @#$%? …011001001001100101… …011001001001100101… Binary data message passing (Edinburgh Parallel Computing Centre)

Point-to-Point Communication • Communication between two processes. • Source process sends message to destination process. • Communication takes place within a communicator. • Destination process is identified by its rank in communicator. (Edinburgh Parallel Computing Centre)

P2P Communication Modes (Edinburgh Parallel Computing Centre)

MPI Communication Modes (Edinburgh Parallel Computing Centre)

Sending a message int MPI_Ssend( void *buffer, int count, MPI_Datatype datatype, int destination, int tag, MPI_Comm comm ); Receiving a message int MPI_Recv( void *buffer, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ); Sending/Receiving A Message (Edinburgh Parallel Computing Centre)

Wildcarding • Receiver can wildcard • From any source (MPI_ANY_SOURCE) • With any tag (MPI_ANY_TAG) • Actual source and tag are returned in receiver’s status parameter. • Sender can not use any wildcard • Destination (receiver) must be specified. • Tag must be specified. (Edinburgh Parallel Computing Centre)

Message Order Preservation • Messages do not overtake each other. • This is true even for non-synchronous sends. (Edinburgh Parallel Computing Centre)

Blocking/Non-Blocking Sends • Blocking send • Returns after completion. • Non-blocking send • Returns immediately. • Allows test and wait. A non-blocking send (Edinburgh Parallel Computing Centre)

Synchronous? Blocking? • Synchronous/Asynchronous sends • Synchronous sends: • Completion implies the receive is completed. • Asynchronous sends: • Completion means the buffer can be re-used, irrespective of whether the receive is completed. • Blocking/Non-blocking sends • Blocking sends • Return only after completion. • Non-blocking sends • Return immediately. (Edinburgh Parallel Computing Centre)

MPI Sending Commands (Edinburgh Parallel Computing Centre)

Non-Blocking Sends • Avoids deadlock • May Improve efficiency Deadlock (Edinburgh Parallel Computing Centre)

Non-Blocking Communications • Step 1: • Initiate non-blocking communication. • E.g., MPI_Isend or MPI_Irecv • Step 2: • Do other work. • Step 3: • Wait for non-blocking communication to complete. • E.g., MPI_Wait (Edinburgh Parallel Computing Centre)

Parallel Computing Using MPI