Introduction to Parallel Programming with MPI

Introduction to Parallel Programming with MPI Morris Law, SCID Apr 25, 2018

Multi-core programming • Currently, most CPUs has multiple cores that can be utilized easily by compiling with openmp support • Programmers no longer need to rewrite a sequential code but to add directives to instruct the compiler for parallelizing the code with openmp. • For reference site: http://bisqwit.iki.fi/story/howto/openmp/

Openmp example /* * Sample program to test runtime of simple matrix multiply * with and without OpenMP on gcc-4.3.3-tdm1 (mingw) * compile with gcc –fopenmp * (c) 2009, RajorshiBiswas */ #include <stdio.h> #include <stdlib.h> #include <time.h> #include <assert.h> #include <omp.h> int main(intargc, char **argv) { inti,j,k; int n; double temp; double start, end, run; printf("Enter dimension ('N' for 'NxN' matrix) (100-2000): "); scanf("%d", &n); assert( n >= 100 && n <= 2000 ); int **arr1 = malloc( sizeof(int*) * n); int **arr2 = malloc( sizeof(int*) * n); int **arr3 = malloc( sizeof(int*) * n); for(i=0; i<n; ++i) { arr1[i] = malloc( sizeof(int) * n ); arr2[i] = malloc( sizeof(int) * n ); arr3[i] = malloc( sizeof(int) * n ); } printf("Populating array with random values...\n"); srand( time(NULL) ); for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { arr1[i][j] = (rand() % n); arr2[i][j] = (rand() % n); } } printf("Completed array init.\n"); printf("Crunching without OMP..."); fflush(stdout); start = omp_get_wtime(); for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { temp = 0; for(k=0; k<n; ++k) { temp += arr1[i][k] * arr2[k][j]; } arr3[i][j] = temp; } } end = omp_get_wtime(); printf(" took %f seconds.\n", end-start); printf("Crunching with OMP..."); fflush(stdout); start = omp_get_wtime(); #pragma omp parallel for private(i, j, k, temp) for(i=0; i<n; ++i) { for(j=0; j<n; ++j) { temp = 0; for(k=0; k<n; ++k) { temp += arr1[i][k] * arr2[k][j]; } arr3[i][j] = temp; } } end = omp_get_wtime(); printf(" took %f seconds.\n", end-start); return 0; }

Compiling for openmp support • GCC gcc –fopenmp –o foo foo.c gfortran –fopenmp –o foo foo.f • Intel Compiler icc -openmp –o foo foo.c ifort –openmp –o foo foo.f • PGI Compiler pgcc -mp –o foo foo.c pgf90 –mp –o foo foo.f

What is Message Passing Interface (MPI)? • Portable standard for communication • Processes can communicate through messages. • Each process is a separable program • All data is private

What is Message Passing Interface (MPI)? • This is a library, not a language!! • Different compilers, but all must use the same libraries, i.e. MPICH, LAM, OPENMPI etc. • Use standard sequential language. Fortran, C, C++, etc.

Basic Idea of Message Passing Interface (MPI) • MPI Environment • Initialize, manage, and terminate communication among processes • Communication between processes • Point to point communication, i.e. send, receive, etc. • Collective communication, i.e. broadcast, gather, etc. • Complicated data structures • Communicate the data effectively • e.g. matrices and memory

Serial P1 P2 P3 Process P1 Process 0 Data exchange via interconnection Message Passing P2 Process 1 Process 2 P3 time time Message Passing Model

MPI include file variable declarations Initialize MPI environment Do work and make message passing calls Terminate MPI Environment General MPI program structure #include <mpi.h> int main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ printf(“Helloworld, I’m P%d of %d\n”,rank,np); ierr = MPI_Finalize(); } Helloworld, I’m P0 of 3 Helloworld, I’m P1 of 3 Helloworld, I’m P2 of 3

When Use MPI? • You need a portable parallel program • You are writing a parallel library • You care about performance • You have a problem that can be solved in parallel ways

F77/F90, C/C++ MPI library calls • Fortran 77/90 uses subroutines • CALL is used to invoke the library call • Nothing is returned, the error code variable is the last argument • All variables are passed by reference • C/C++ uses functions • Just the name is used to invoke the library call • The function returns an integer value (an error code) • Variables are passed by value, unless otherwise specified

Types of Communication • Point to Point Communication • communication involving only two processes. • Collective Communication • communication that involves a group of processes.

Implementation of MPI

Browse the sample files • Inside your home directory, the sample zip file, mpi-1.zip has been stored for the laboratory. • Please unzip the file unzip mpi-1.zip • There shall be 4 subdirectories inside mpi-1 ls –l mpi-1 total 20 drwxr-xr-x. 2 morris dean 4096 Nov 27 09:55 00openmp drwxrwxr-x. 2 morris dean 4096 Nov 27 09:41 0-hello drwxrwxr-x. 2 morris dean 4096 Nov 27 09:51 array-pi drwxrwxr-x. 2 morris dean 4096 Nov 27 09:49 mc-pi drwxrwxr-x. 2 morris dean 4096 Nov 27 09:51 series-pi • The 4 subdirectories stored sample mpi-1 programs with README files for this laboratory.

First MPI C program:- hello1.c • Change directory to hello and use an editor, e.g. nano to open hello1.c cd hello nano hello1.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int version, subversion; MPI_Init(&argc, &argv); MPI_Get_version(&version, &subversion); printf("Hello world!\n"); printf("Your MPI Version is: %d.%d\n", version, subversion); MPI_Finalize(); return(0); }

First MPI Fortran program:- hello1.f • Use an editor to open hello1.f cd hello nano hello1.f program main include 'mpif.h' integer ierr, version, subversion call MPI_INIT(ierr) call MPI_GET_VERSION(version, subversion, ierr) print *, 'Hello world!' print *, 'Your MPI Version is: ', version, '.', subversion call MPI_FINALIZE(ierr) end

Second MPI C program:- hello2.c • Use an editor to open hello2.c cd hello nano hello2.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am P%d of %d\n", rank, size); MPI_Finalize(); return(0); }

Second MPI Fortran program:- hello2.f • Use an editor to open hello2.f cd hello nano hello2.f program main include 'mpif.h' integer rank, size, ierr call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) print *, 'Hello world! I am P', rank, ' of ', size call MPI_FINALIZE(ierr) end

Make all files in hello • ‘Makefile’ is written for each example directory. • Run ‘make’ will compile all hello examples make /usr/lib64/mpich/bin/mpif77 -o helloF1 hello1.f /usr/lib64/mpich/bin/mpif77 -o helloF2 hello2.f /usr/lib64/mpich/bin/mpicc -o helloC1 hello1.c /usr/lib64/mpich/bin/mpicc -o helloC2 hello2.c

mpirun hello examples in foreground • You may run the hello examples in foreground by specifying the no of processors and the machinefile with mpirun. e.g. mpirun –np 4 –machinefile machine ./helloC2 Hello world! I am P0 of 4 Hello world! I am P2 of 4 Hello world! I am P3 of 4 Hello world! I am P1 of 4 • machine is the file storing the hostname you want the programs run.

Exercise • Follow the above hello example, mpirun helloC1, helloF1 and helloF2 in foreground with 4 processors in foreground • Change directory to mc-pi, • compile all programs inside using ‘make’ • Run mpi-mc-pi using 2,4,8 processors. • Change directory to series-pi, • compile all programs inside using ‘make’ • Run series-pi using 2,4,6,8 processors. • Note the time difference

Parallelization example: serial-pi.c #include <stdio.h> static long num_steps = 10000000; double step; int main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f\n",pi); } 22

Parallelizing serial-pi.c into mpi-pi.c:-Step 1: Adding MPI environment #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, pi, sum = 0.0; MPI_Init(&argc,&argv); step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f\n",pi); MPI_Finalize(); }

Parallelizing serial-pi.c into mpi-pi.c :-Step 2: Adding variables to print ranks #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f, Processor %d of %d \n",pi, rank, size); MPI_Finalize(); }

Parallelizing serial-pi.c into mpi-pi.c :-Step 3: divide the workload #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, mypi, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=rank;i< num_steps; i+=size){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } mypi = step * sum; printf("Est Pi= %f, Processor %d of %d \n",mypi, rank, size); MPI_Finalize(); }

Parallelizing serial-pi.c into mpi-pi.c :-Step 4: collect partial results #include "mpi.h" #include <stdio.h> static long num_steps = 10000000; double step; int main (int argc, char *argv[]) { int i; double x, mypi, pi, sum = 0.0; int rank, size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); step = 1.0/(double) num_steps; for (i=rank;i< num_steps; i+=size){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } mypi = step * sum MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (rank==0) printf("Est Pi= %f, \n",pi); MPI_Finalize(); }

Compile and run mpi program $ mpicc –o mpi-pi mpi-pi.c $ mpirun -np 4 -machinefile machines mpi-pi

Parallelization example 2: serial-mc-pi.c #include <stdio.h> #include <stdlib.h> #include <time.h> main(int argc, char *argv[]) { long in,i,n; double x,y,q; time_t now; in = 0; srand(time(&now)); printf("Input no of samples : "); scanf("%ld",&n); for (i=0;i<n;i++) { x = rand()/(RAND_MAX+1.0); y = rand()/(RAND_MAX+1.0); if ((x*x + y*y) < 1) { in++; } } q = ((double)4.0)*in/n; printf("pi = %.20lf\n",q); printf("rmse = %.20lf\n",sqrt(( (double) q*(4-q))/n)); } 2r

Parallelization example 2: mpi-mc-pi.c 2r #include "mpi.h" #include <stdio.h> #include <stdlib.h> #include <time.h> main(int argc, char *argv[]) { long in,i,n; double x,y,q,Q; time_t now; int rank,size; MPI_Init(&argc, &argv); in = 0; MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); srand(time(&now)+rank); if (rank==0) { printf("Input no of samples : "); scanf("%ld",&n); } MPI_Bcast(&n,1,MPI_LONG,0,MPI_COMM_WORLD); for (i=0;i<n;i++) { x = rand()/(RAND_MAX+1.0); y = rand()/(RAND_MAX+1.0); if ((x*x + y*y) < 1) { in++; } } q = ((double)4.0)*in/n; MPI_Reduce(&q,&Q,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); Q = Q / size; if (rank==0) { printf("pi = %.20lf\n",Q); printf("rmse = %.20lf\n",sqrt(( (double) Q*(4-Q))/n/size)); } MPI_Finalize(); }

Compile and run mpi-mc-pi $ mpicc –o mpi-mc-pi mpi-mc-pi.c $ mpirun -np 4 -machinefile machines mpi-mc-pi

distribute your data among the processes information of all processes is used to provide a condensed result by/for one process 1 3 5 7 105 reduction scatter Collective communication (e.g. PROD)

MPI_Scatter Distributes data from root to all other tasks in a group int MPI_Scatter (void *sendbuf, int sendcnt, MPI_Datatype sendtype ,void *recvbuf,int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )

Data Data a[0] a[0] a[1] a[1] a[2] a[2] a[3] a[3] m m 0 0 10 Processor Processor 1 1 11 2 2 10 10 11 11 12 12 13 13 12 3 3 13 MPI_Scatter • Example: two vectors are distributed in order to prepare a parallel computation of their scalar product MPI_Scatter MPI_Scatter(&a,1,MPI_INT,&m,1,MPI_INT,2,MPI_COMM_WORLD);

Reduces values on all processes to a single value on root. int MPI_Reduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm ) MPI_Reduce

Data Data a b c d a b c d 0 2 0 2 Processor Processor 1 3 1 3 2 5 2 5 19 3 9 3 9 MPI_Reduce • Example: calculation of the global minimum of the variables kept by all processes, calculation of a global sum, etc. MPI_Reduce op:MPI_SUM MPI_Reduce(&b,&d,1,MPI_INT,MPI_SUM,2,MPI_COMM_WORLD);

MPI Datatype

Thanks Please give me some comments https://goo.gl/forms/880NY3kZ9h7ay7r32 Morris Law, SCID (morris@hkbu.edu.hk)

Introduction to Parallel Programming with MPI