An Introduction to Parallel Programming with MPI

An Introduction to Parallel Programming with MPI March 22, 24, 29, 31 2005 David Adams

Outline • Disclaimers • Overview of basic parallel programming on a cluster with the goals of MPI • Batch system interaction • Startup procedures • Blocking message passing • Non-blocking message passing • Collective Communications

Disclaimers • I do not have all the answers. • Completion of this short course will give you enough tools to begin making use of MPI. It will not “automagically” allow your code to run on a parallel machine simply by logging in. • Some codes are easier to parallelize than others.

The Goals of MPI • Design an application programming interface. • Allow efficient communication. • Allow for implementations that can be used in a heterogeneous environment. • Allow convenient C and Fortran 77 bindings. • Provide a reliable communication interface. • Portable. • Thread safe.

Message Passing Paradigm

Message Passing Paradigm • Conceptually, all processors communicate through messages (even though some may share memory space). • Low level details of message transport are handled by MPI and are invisible to the user. • Every processor is running the same program but will take different logical paths determined by self processor identification (Who am I?). • Programs are written, in general, for an arbitrary number of processors though they may be more efficient on specific numbers (powers of 2?).

Distributed Memory and I/O Systems • The cluster machines available at Virginia Tech are distributed memory distributed I/O systems. • Each node (processor pair) has its own memory and local hard disk. • Allows asynchronous execution of multiple instruction streams. • Heavy disk I/O should be delegated to the local disk instead of across the network and minimized as much as possible. • While getting your program running, another goal to keep in mind is to see that it makes good use of the hardware available to you. • What does “good use” mean?

Speedup • The speedup achieved by a parallel algorithm running on p processors is the ratio between the time taken by that parallel computer executing the fastest serial algorithm and the time taken by the same parallel computer executing the parallel algorithm using p processors. • -Designing Efficient Algorithms for Parallel Computers, Michael J. Quinn

Speedup • Sometimes a “fastest serial version” of the code is unavailable. • The speedup of a parallel algorithm can be measured based on the speed of the parallel algorithm run serially but this gives an unfair advantage to the parallel code as the inefficiencies of making the code parallel will also appear in the serial version.

Speedup Example • Our really_big_code01 executes on a single processor in 100 hours. • The same code on 10 processors takes 10 hours. • 100 hrs./10 hrs. = 10 = speedup. • When speedup = p it is called ideal (or perfect) speedup. • Speedup by itself is not very meaningful. A speedup of 10 may sound good (We are solving the problem 10 times as fast!) but what if we were using 1000 processors to get that number?

Efficiency • The efficiency of a parallel algorithm running on p processors is the speedup divided by p. • -Designing Efficient Algorithms for Parallel Computers, Michael J. Quinn • From our last example, • when p = 10 the efficiency is 10/10=1 (great!), • When p = 1000 the efficiency is 10/1000=0.01 (bad!). • Speedup and efficiency give us an idea of how well our parallel code is making use of the available resources.

Concurrency • The first step in parallelizing any code is to identify the types of concurrency found in the problem itself (not necessarily the serial algorithm). • Many parallel algorithms show few resemblances to the (fastest known) serial version they are compared to and sometimes require an unusual perspective on the problem.

Concurrency • Consider the problem of finding the sum of n integer values. • A sequential algorithm may look something like this: • BEGIN • sum= A0 • FOR i = 1 TO n – 1 DO • sum = sum + Ai • ENDFOR • END

Concurrency • Suppose n = 4. Then the additions would be done in a precise order as follows: • [(A0 +A1) + A2] +A3 • Without any insight into the problem itself we might assume that the process is completely sequential and can not be parallelized. • Of course, we know that addition is associative (mostly). The same expression could be written as: • (A0 +A1) + (A2 +A3) • By using our insight into the problem of addition we can exploit the inherent concurrency of the problem and not the algorithm.

Communication is Slow • Continuing our example of adding n integers we may want to parallelize the process to exploit as much concurrency as possible. We call on the services of Clovus the Parallel Guru. • Let n = 128. • Clovus divides the integers into pairs and distributes them to 64 processors maximizing the concurrency inherent in the problem. • The solution to the 64 sub-problems are distributed to 32 and those 32 to 16 etc…

Communication Overhead • Suppose it takes t units of time to perform a floating-point addition. • Suppose it takes 100t units of time to pass a floating-point number from one processor to another. • The entire calculation on a single processor would take 127t time units. • Using the maximum number of processors possible (64) Clovus finds the sum of the first set of pairs in 101t time units. Further steps for 32, 16, 8, 4, and 2 follow to obtain the final solution. • (64) (32) (16) (8) (4) (2) • 101t + 101t + 101t + 101t + 101t + 101t + =606t total time units

Parallelism and Pipelining to Achieve Concurrency • There are two primary ways to achieve concurrency in an algorithm. • Parallelism • The use of multiple resources to increase concurrency. • Partitioning. • Example: Our summation problem. • Pipelining • Dividing the computation into a number of steps that are repeated throughout the algorithm. • An ordered set of segments in which the output of each segment is the input of its successor. • Example: Automobile assembly line.

Examples(Jacobi style update) • Imagine we have a cellular automata that we want to parallelize. 7 8 … 1 2 3 4 5 6

Examples • We try to distribute the rows evenly between two processors. 7 8 … 1 2 3 4 5 6

Examples • Columns seem to work better for this problem. 7 8 … 1 2 3 4 5 6

Examples • Minimizing communication. 7 8 … 1 2 3 4 5 6

Examples(Gauss-Seidel style update) • Emulating a serial Gauss-Seidel update style with a pipe. 7 8 … 1 2 3 4 5 6

Batch System Interaction • Both Anantham (400 processors) and System “X” (2200 processors) will normally operate in batch mode. • Jobs are not interactive. • Multi-user etiquette is enforced by a job scheduler and queuing system. • Users will submit jobs using a script file built by the administrator and modified by the user.

PBS (Portable Batch Scheduler) Submission Script • #/bin/bash • #! • #! Example of job file to submit parallel MPI applications. • #! Lines starting with #PBS are options for the qsub command. • #! Lines starting with #! are comments • #! Set queue (production queue --- the only one right now) and • #! the number of nodes. • #! In this case we require 10 nodes from the entire set ("all"). • #PBS -q prod_q • #PBS -l nodes=10:all

PBS Submission Script • #! Set time limit. • #! The default is 30 minutes of cpu time. • #! Here we ask for up to 1 hour. • #! (Note that this is *total* cpu time, e.g., 10 minutes on • #! each of 4 processors is 40 minutes) • #! Hours:minutes:seconds • #PBS -l cput=01:00:00 • #! Name of output files for std output and error; • #! Defaults are <job-name>.o<job number> and <job-name>.e<job-number> • #!PBS -e ZCA.err • #!PBS -o ZCA.log

PBS Submission Script • #! Mail to user when job terminates or aborts • #! #PBS -m ae • #!change the working directory (default is home directory) • cd $PBS_O_WORKDIR • #! Run the parallel MPI executable (change the default a.out) • #! (Note: omit "-kill" if you are running a 1 node job) • /usr/local/bin/mpiexec -kill a.out

Common Scheduler Commands • qsub <script file name> • Submits your script file for scheduling. It is immediately checked for validity and if it passes the check you will get a message that your job has been added to the queue. • qstat • Displays information on jobs waiting in the queue and jobs that are running. How much time they have left and how many processors they are using. • Each job aquires a unique job_id that can be used to communicate with a job that is already running (perhaps to kill it). • qdel <job_id> • If for some reason you have a job that you need to remove from the queue, this command will do it. It will also kill a job in progress. • You, of course, only have access to delete your own jobs.

MPI Data Types • MPI thinks of every message as a starting point in memory and some measure of length along with a possible interpretation of the data. • The direct measure of length (number of bytes) is hidden from the user through the use of MPI data types. • Each language binding (C and Fortran 77) has its own list of MPI types that are intended to increase portability as the length of these types can change from machine to machine. • Interpretations of data can change from machine to machine in heterogeneous clusters (Macs and PCs in the same cluster for example).

MPI types in C • MPI_CHAR – signed char • MPI_SHORT – signed short int • MPI_INT – signed int • MPI_LONG – signed long int • MPI_UNSIGNED_CHAR – unsigned short int • MPI_UNSIGNED – unsigned int • MPI_UNSIGNED_LONG – unsigned long int • MPI_FLOAT – float • MPI_DOUBLE – double • MPI_LONG_DOUBLE – long double • MPI_BYTE • MPI_PACKED

MPI Types in Fortran 77 • MPI_INTEGER – INTEGER • MPI_REAL – REAL • MPI_DOUBLE_PRECISION – DOUBLE PRECISION • MPI_COMPLEX – COMPLEX • MPI_LOGICAL – LOGICAL • MPI_CHARACTER – CHARACTER(1) • MPI_BYTE • MPI_PACKED • Caution: Fortran90 does not always store arrays contiguously.

Functions Appearing in all MPI Programs (Fortran 77) • MPI_INIT(IERROR) • INTEGER IERROR • Must be called before any other MPI routine. • Can be visualized as the point in the code where every processor obtains its own copy of the program and continues to execute though this may happen earlier.

Functions Appearing in all MPI Programs (Fortran 77) • MPI_FINALIZE (IERROR) • INTEGER IERROR • This routine cleans up all MPI state. • Once this routine is called no MPI routine may be called. • It is the users responsibility to ensure that ALL pending communications involving a process complete before the process calls MPI_FINALIZE

Typical Startup Functions • MPI_COMM_SIZE(COMM, SIZE, IERROR) • IN INTEGER COMM • OUT INTEGER SIZE, IERROR • Returns the size of the group associated with the communicator COMM. • …What’s a communicator?

Communicators • A communicator is an integer that tells MPI what communication domain it is in. • There is a special communicator that exists in every MPI program called MPI_COMM_WORLD. • MPI_COMM_WORLD can be thought of as the superset of all communication domains. Every processor requested by your initial script is a member of MPI_COMM_WORLD.

An Introduction to Parallel Programming with MPI