An introduction to parallel computing with the message passing interface
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

An Introduction to Parallel Computing with the Message Passing Interface PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

An Introduction to Parallel Computing with the Message Passing Interface. Justin T. Newcomer Math 627 – Introduction to Parallel Computing University of Maryland, Baltimore County (UMBC) December 19, 2006. Acknowledgments: Dr. Matthias K. Gobbert, UMBC. The Need for Parallel Computing.

Download Presentation

An Introduction to Parallel Computing with the Message Passing Interface

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An introduction to parallel computing with the message passing interface

An Introduction to Parallel Computing with the Message Passing Interface

Justin T. Newcomer

Math 627 – Introduction to Parallel Computing

University of Maryland, Baltimore County (UMBC)

December 19, 2006

Acknowledgments: Dr. Matthias K. Gobbert, UMBC


The need for parallel computing

The Need for Parallel Computing

  • To increase computational speed

    • Programs need to run in a “reasonable” amount of time

      • Predicting tomorrows weather can’t take 2 days

  • To increase available memory

    • Problem sizes continue to increase

  • To improve the accuracy of results

    • If we can solve larger problems faster than we can improve the accuracy of our results

  • Bottom line: We want to solve larger problems faster and more accurately


Classification of parallel systems

Classification of Parallel Systems

  • In 1966 Michael Flynn classified systems according to the number of instruction streams and data streams

    • This is known as Flynn’s Taxonomy

    • The two extremes are SISD and MIMD systems

  • Single-Instruction Single-Data (SISD)

    • The classical von Neumann Machine – CPU and main memory

  • Multiple-Instruction Multiple-Data

    • A collection of processors operate on their own data streams

  • Intermediate systems are SIMD and MISD systems

  • Parallel systems can be shared memory or distributed memory


Single program multiple data spmd

Single-Program Multiple-Data (SPMD)

  • The most general form of a MIMD system is where each process runs a completely different program

  • In practice this is usually not needed

  • The “appearance” of each process running a different program is accomplished through branching statements on the process id’s

  • This form of MIMD programming is known as Single-Program Multiple-Data (SPMD)

  • NOT the same as SIMD (Single-Instruction Multiple-Data)

  • Message passing is the most common method of programming MIMD systems

  • This talk will focus on SPMD programs


Parallel resources at umbc

Parallel Resources at UMBC

  • math-cluster4 – 8 processor Beowulf cluster

    • Four dual-processor Linux PCs

    • 1000 MHz Intel Pentium III processors and 1 GB of memory

    • Nodes are connected by 100 Mbps ethernet cables and a dedicated switch

    • The only machine with a connection to the outside network is pc51

  • KALI – 64 processor Beowulf Cluster

    • Purchased using funds from a SCREMS grant from the National Science Foundation with equal cost-sharing from UMBC

    • Used for research projects including the areas of microelectronics manufacturing, quantum chemistry, computational neurobiology, and constrained mechanical systems

    • The machine is managed jointly by system administrators from the department and UMBC's Office of Information Technology

  • Hercules - 8 processor P4 IBM X440 System (MPI Not Available)


Hardware specifications of the cluster kali

Hardware Specifications of the Cluster KALI

  • Executive Summary:

    • 64-processor Beowulf cluster

    • with a high-performance Myrinet interconnect

  • Summary:

    • Each node: 2 Intel Xenon 2.0 GHz (512 kB L2 cache) with at least 1 GB of memory

    • 32 computational nodes (31 compute and 1 storage)

    • High-performance Myrinet network for computations

    • Ethernet for file serving from a 0.5 TB RAID (= redundant array of independent disks) array

    • 1 management and user node


The network schematic

The Network Schematic

  • Management network not shown


Physical layout in two racks 42 u high

Physical Layout in Two Racks (42 U high)


Front of the whole cluster kali

Front of the Whole Cluster KALI


Front of rack h6 mgtnode raid storage1

Front of Rack H6 (mgtnode, RAID, storage1)


Back of rack h6 raid storage1 myrinet

Back of Rack H6 (RAID, storage1, Myrinet)


Front of computational nodes rack h5

Front of Computational Nodes (Rack H5)


Back of the computational nodes rack h5

Back of the Computational Nodes (Rack H5)

Bottom Half

Top Half


How to program a beowulf cluster

How to Program a Beowulf Cluster

  • Memory is distributed across nodes and only accessible by local CPU’s

  • Total memory: 321 GB

  • But: 2 CPU’s share the memory of a single node

    • Should one use both CPU’s per node or only one?

  • Algorithm design: Divide problem into pieces with as little dependence on each other as possible, then program communications explicitly using MPI (Message-Passing Interface)

    • Fully portable code

  • Typical problems

    • Domain split  communication of solution values on interfaces (lower-dimensional region)

    • Communication at every time-step / in every iteration


What is mpi

What is MPI?

  • The Message Passing Interface (MPI) forum has developed a standard for programming parallel systems

  • Rather than specifying a new language (and a new compiler), MPI has taken the form of a library of functions that can be called from a C, C++, or Fortran program

  • The foundation of this library are the various functions that can be used to achieve parallelism by message passing

  • A message passing function is simply a function that explicitly transmits data from one process to another


The message passing philosophy

The Message Passing Philosophy

  • Parallel programs consist of several processes, each with its own memory, working together to solve a single function

  • Processes communicate with each other by passing messages (data) back and forth

  • Message passing is a powerful and very general method of expressing parallelism

  • MPI provides a way to create efficient, portable, and scalable parallel code


Message passing example hello world

Message Passing Example – “Hello World”

#include <stdio.h>

#include <string.h>

#include "mpi.h"

int main(int argc, char* argv[])

{

int my_rank; /* Rank of process */

int np; /* Number of processes */

int source; /* Rank of sender */

int dest; /* Rank of receiver */

int tag = 0; /* Tag for messages */

char message[100]; /* Storage for message */

MPI_Status status; /* Return status for receive */

/* Initialize MPI */

MPI_Init(&argc, &argv);

/* Acquire current process rank */

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

/* Acquire number of processes being used */

MPI_Comm_size(MPI_COMM_WORLD, &np);


Message passing example

Message Passing Example

if(my_rank != 0)

{

sprintf(message, "Greetings from process %d! I am one of %d processes.",

my_rank, np);

dest = 0;

MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,MPI_COMM_WORLD);

}

else

{

for(source = 1; source < p; source++)

{

MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);

printf("%s\n", message);

}

}

MPI_Finalize();

return 0;

}


Example output

Example Output

Greetings from process 0! I am one of 16 processes

Greetings from process 1! I am one of 16 processes

Greetings from process 2! I am one of 16 processes

Greetings from process 3! I am one of 16 processes

Greetings from process 4! I am one of 16 processes

Greetings from process 5! I am one of 16 processes

Greetings from process 6! I am one of 16 processes

Greetings from process 7! I am one of 16 processes

Greetings from process 8! I am one of 16 processes

Greetings from process 9! I am one of 16 processes

Greetings from process 10! I am one of 16 processes

Greetings from process 11! I am one of 16 processes

Greetings from process 12! I am one of 16 processes

Greetings from process 13! I am one of 16 processes

Greetings from process 14! I am one of 16 processes

Greetings from process 15! I am one of 16 processes


Available compilers on kali

Available Compilers on KALI

  • Two suites of compilers are available on kali, one from Intel and one from the GNU

    • The Intel compilers are icc for C/C++ and ifort for Fortran 90/95

    • The GNU compilers are gcc for C/C++ and g77 for Fortran 77

  • You can list all available MPI implementations by

    > switcher mpi --list

    This should list lam-7.0.6, mpich-ch_gm-icc-1.2.5.9, and mpich-ch_gm-gcc-1.2.5.9

  • You can show the current MPI implementation by

    > switcher mpi --show

  • If you want to switch to another MPI implementation, for instance, to use the Intel compiler suite, say

    > switcher mpi = mpich-ch_gm-icc-1.2.5.9


Compiling and linking mpi code on kali

Compiling and Linking MPI Code on KALI

  • Let's assume that you have a C code sample.c that contains some MPI commands. The compilation and linking of your code should work just like on other Linux clusters using the mpich implementation of MPI. Hence, compile and link the code both in one step by

    > mpicc -o sample sample.c

  • If your code includes mathematical functions (like exp, cos, etc.), you need to link to the mathematics library libm.so. This is done, just like for serial compiling, by adding -lm to the end of your combined compile and link command, that is,

    > mpicc -o sample sample.c -lm

    In a similar fashion, other libraries can be linked

  • See the man page of mpicc for more information by saying

    > man mpicc

  • Finally, to be doubly sure which compiler is accessed by your MPI compile script, you can use the -show option as in

    > mpicc -o sample sample.c -show


Submitting a job on kali

Submitting a Job on KALI

  • KALI uses the TORQUE resource manager and the Maui scheduler which are both open source programs

  • A job, an executable with its command-line arguments, is submitted to the scheduler with the qsub command

  • In the directory, in which you want to run your code, you need to create a script file that tells the scheduler more details about how to start the code, what resources you need, where to send output, and some other items

  • This script is used as command-line argument to the qsub command by saying

    > qsub qsub-script


Example qsub script

Example qsub Script

  • Let's call this file qsub-script in this example. It should look like this:

#!/bin/bash

:

: The following is a template for job submission for the

: Scheduler on kali.math.umbc.edu

:

: This defines the name of your job

#PBS -N MPI_Sample

: This is the path

#PBS -o .

#PBS -e .

#PBS -q workq

#PBS -l nodes=8:myrinet:ppn=2

cd $PBS_O_WORKDIR

mpiexec -nostdout <-pernode> Sample


The qstat command

The qstat command

  • Once you have submitted your job to the scheduler, you will want to confirm that it has been entered into the queue. Use qstat at the command-line to get output similar to this:

  • The most interesting column is the one titled S for “status”

    • The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed

    • The letter R indicates that your job is currently running

    • The letter E says that your job is exiting; this will appear during the shut-down phase, after the job has actually finished execution

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------ -------- ----- ---------- ------ --- --- ------ ----- - ----- 635.mgtnode gobbert workq MPI_DG 2320 8 1 -- 10000 R 716:0

636.mgtnode gobbert workq MPI_DG 2219 8 1 -- 10000 R 716:1

665.mgtnode gobbert workq MPI_Nodesu -- 16 1 -- 10000 Q --

704.mgtnode gobbert workq MPI_Nodesu 12090 15 1 -- 10000 E 00:00

705.mgtnode kallen1 workq MPI_Aout -- 1 1 -- 10000 Q --

706.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q --

707.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q --


An application the poisson problem

An Application: The Poisson Problem

  • The Poisson problem is a partial differential equation that is discretized by the finite difference method using a five-point stencil.

  • The Poisson problem can be expressed by the equations

  • We can approximate the Poisson problem using the finite element method and use the iterative Jacobi method to obtain a numerical solution given by

  • Here we have partitioned the domain into a mesh grid of dimension NN

  • This produces a sparse matrix of dimension N2N2

  • This does not present a problem however because applying the Jacobi method to this problem gives us a matrix free solution


Parallel implementation

Parallel Implementation

  • The mesh grid is distributed among processes by blocks of rows of the mesh

  • The boundary points then must be communicated to neighboring processes at each iteration to obtain the updates


The parallel algorithm

The Parallel Algorithm

  • The generic algorithm involves four steps per iteration:

  • The exchange of boundary points is accomplished as follows:

1) Post communication requests to exchange neighboring points

2) Calculate Jacobi iteration on all local interior points

3) Wait for the exchange of neighboring points to complete

4) Calculate Jacobi iteration on all local boundary points

process = 0

send top row of local mesh to process 1

receive above points from process 1

process = i : 0 < i < np-1

send top row of local mesh to process i+1

send bottom row of local mesh to process i-1

receive above points from process i+1

receive below points from process i-1

process = np-1

send bottom row of local mesh to process np-2

receive below points from process np-2


Sample code

Sample Code

MPI_Barrier(MPI_COMM_WORLD);

start = MPI_Wtime();

err = ((double) 1) + tol;

it = 0;

while( (err > tol) && (it < itmax) )

{

it = it + 1;

if(it > 1){update(l_u, l_new_u, l_N, N);}

local_jacobi_sweep(l_u, l_new_u, N, l_N, id);

exchange1(l_u, below_points, above_points, N, l_N, np, id);

bound_jacobi_sweep(l_u, below_points, above_points, l_new_u, N, l_N, id);

l_err = vector_norm(l_u, l_new_u, N, l_N);

MPI_Allreduce(&l_err, &err, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

err = sqrt(err);

}

MPI_Barrier(MPI_COMM_WORLD);

finish = MPI_Wtime();


The exchange function

The Exchange Function

void exchange1(double *l_u, double *below_points, double *above_points, int n,

int l_n, int np, int id)

{

int idbelow, idabove;

MPI_Status status;

get_neighbors(&idbelow, &idabove, id, np);

if(id%2 == 0)

{

MPI_Send(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD);

MPI_Recv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &status);

MPI_Send(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD);

MPI_Recv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &status);

}

else

{

MPI_Recv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &status);

MPI_Send(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD);

MPI_Recv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &status);

MPI_Send(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD);

}

}


Collective communications

Collective Communications

  • MPI also includes specialized functions that allow collective communication. Collective communication is communication that involves all processes

  • In the Poisson problem example we need to calculate the Euclidean vector norm of the difference between two iterations at each iteration. The norm is given by:

  • Once computed, the norm must be available on each process

Computed locally on each process


Mpi allreduce

MPI_Allreduce

  • The MPI_Allreduce function is a collective communication function provided can accomplish the task we need to accomplish

  • It reduces the local pieces of the norm using the MPI_SUM operation and then broadcasts the result to all processes

  • The MPI_Allreduce function can use other operations such as maximum, minimum, product, etc.

  • Other collective communication functions:

    • MPI_Bcast, MPI_Reduce

    • MPI_Gather, MPI_Scatter, MPI_Allgather

l_err = vector_norm(l_u, l_new_u, N, l_N);

MPI_Allreduce(&l_err, &err, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

err = sqrt(err);


Non blocking communication

Non-Blocking Communication

  • Non-blocking communications allow us to post the communications first and return to perform local communications immediately:

  • The Exchange function is similar to the pervious, except it uses the non-blocking functions MPI_Isend and MPI_Irecv

    Note: The non-blocking functions also eliminate the need to order the send and receive functions

exchange3(l_u, below_points, above_points, N, l_N, np, id, requests);

local_jacobi_sweep(l_u, l_new_u, N, l_N, id);

MPI_Waitall(4, requests, status);

bound_jacobi_sweep(l_u, below_points, above_points, l_new_u, N, l_N, id);

void exchange2(double *l_u, double *below_points, double *above_points, int n, int l_n, int

np, int id, MPI_Request *requests)

{

int idbelow, idabove;

get_neighbors(&idbelow, &idabove, id, np);

MPI_Irecv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &requests[0]);

MPI_Irecv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &requests[1]);

MPI_Isend(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD, &requests[2]);

MPI_Isend(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD, &requests[3]);

}


Performance measures for parallel computing

Performance Measures for Parallel Computing

  • Speedup Sp(N): How much faster are p processors over 1 processor (for a problem of a fixed size)? Optimal value: Sp(N) = p

  • Efficiency Ep(N): How close to optimal is the speedup? Optimal value: Ep(N) = 1 = 100%

    Tp(N) = time for problem size N on p processors

  • Speedup and efficiency for a fixed problem size are tough measures of parallel performance, because inevitably communication will eventually dominate (for truly parallel code)


Issue what is t p n

Issue: What is Tp(N)?

  • Parallel program spends time in calculations and in communications. Communication time is affected by latency (initialization of communications) and bandwidth (throughput capability)

  • Fundamental problem of parallel computing: Communications hurt but are unavoidable (for a truly parallel algorithm), hence we must include them in out timings  wall clock tine is used, not CPU time

  • What wall clock time? Additional issues: OS delays, MPI/network startup, file access for input (1 file read by all processors) and output (all processors write to a file, to where? central or local), etc.

  • What is T1(N)? Parallel code on a single processor or serial code with the same algorithm or a different “best known” algorithm

    • Example: Jacobi vs. Gauss-Seidel for linear solve

  • In summary, two ways to get good speedup: fast parallel code or slow serial timing (due to any reason)


Speedup and efficiency for the poisson problem

Speedup and Efficiency for the Poisson Problem

Blocking Send and Receive


Speedup and efficiency for the poisson problem1

Speedup and Efficiency for the Poisson Problem

Non-Blocking Send and Receive


Conclusions

Conclusions

  • MPI provides a way to write efficient and portable parallel programs

  • MPI provides many built-in functions that help make programming and collective communications easier

  • Advanced point-to-point communication functions are also available

    • The performance increase may be system dependent

  • For more information on KALI at UMBC go to:

    http://www.math.umbc.edu/~gobbert/kali/


References

References

  • P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1997.

  • M.K. Gobbert. Configuration and Performance of a Beowulf Cluster for Large-Scale Scientific Simulations. Computing in Science and Engineering, vol. 7, no. 2, pp. 14-26, March/April 2005.


  • Login