1 / 60

Introduction to MPI, OpenMP, Threads

Introduction to MPI, OpenMP, Threads. Gyan Bhanot gyan@ias.edu , gyan@us.ibm.com. IAS Course 10/12/04 and 10/13/04. Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gz Has many MPI codes + .doc files with information on optimization and parallelization for the IAS cluster.

dianne
Download Presentation

Introduction to MPI, OpenMP, Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to MPI, OpenMP, Threads Gyan Bhanot gyan@ias.edu, gyan@us.ibm.com IAS Course 10/12/04 and 10/13/04

  2. Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gzHas many MPI codes + .doc files with information on optimization and parallelization for the IAS cluster

  3. P655 Cluster  Type: qcpu to get machine specs

  4. IAS Cluster Characteristics (qcpu,pmcycles) IBM P655 cluster Each node has it's own copy of AIX – which is IBM’s Unix OS Clustermgr: 2 CPU PWR4, 64KB L1 Inst Cache , 32 KB L1 Data Cache, 128 B L1 Data Cache Line Size 1536 KB L2 Cache, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4 , freq = 1200 MHz node1 to node6: 8 CPUs/node, PWR4 P655, 64 KB L1 Inst Cache, 32 KB L1 Data Cache, 128 B Data Cache Line Size, 1536 KB L2 Cache Size, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4, freq = 1500 MHz Distributed-memory architecture, shared-memory within each node. Shared file-system : GPFS, Lots of disk space. Run pingpong tests to determine Latency and Bandwidth

  5. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  6. /*----------------------*//* Parallel hello world *//*----------------------*/#include <stdio.h>#include <math.h>#include <mpi.h>Int main(int argc, char * argv[]){ int taskid, ntasks; double pi; /*------------------------------------*/ /* establish the parallel environment */ /*------------------------------------*/ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /*------------------------------------*/ /* say hello from each MPI task */ /*------------------------------------*/ printf("Hello from task %d.\n", taskid); if (taskid == 0) pi = 4.0*atan(1.0); else pi = 0.0; /*------------------------------------*/ /* do a broadcast from node 0 to all */ /*------------------------------------*/ MPI_Bcast(&pi, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); printf("node %d: pi = %.10lf\n", taskid, pi); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return(0);}

  7. OUTPUT FROM hello.c on 4 processors Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.Hello from task 2.Hello from task 3.node 1: pi = 3.1415926536node 2: pi = 3.1415926536node 3: pi = 3.1415926536 1. Why is the order messed up? 2. What would you do to fix it?

  8. Answer: • The control flow on different processors is not ordered – they all run their • own copy of the executable independently. Thus, when each writes output • it does so independently of the others – which makes the output • unordered. • 2. To fix it : • export MP_STDOUTMODE=ordered • Then the output will look like the following: Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.node 1: pi = 3.1415926536Hello from task 2.node 2: pi = 3.1415926536Hello from task 3.node 3: pi = 3.1415926536

  9. Pingpong Code on 4 procs of P655 cluster • /* This program times blocking send/receives, and reports the *//* latency and bandwidth of the communication system. It is *//* designed to run with an even number of mpi tasks.

  10. msglen = 32000 bytes, elapsed time = 0.3494 msecmsglen = 40000 bytes, elapsed time = 0.4000 msecmsglen = 48000 bytes, elapsed time = 0.4346 msecmsglen = 56000 bytes, elapsed time = 0.4490 msecmsglen = 64000 bytes, elapsed time = 0.5072 msecmsglen = 72000 bytes, elapsed time = 0.5504 msecmsglen = 80000 bytes, elapsed time = 0.5503 msecmsglen = 100000 bytes, elapsed time = 0.6499 msecmsglen = 120000 bytes, elapsed time = 0.7484 msecmsglen = 140000 bytes, elapsed time = 0.8392 msecmsglen = 160000 bytes, elapsed time = 0.9485 msecmsglen = 240000 bytes, elapsed time = 1.2639 msecmsglen = 320000 bytes, elapsed time = 1.5975 msecmsglen = 400000 bytes, elapsed time = 1.9967 msecmsglen = 480000 bytes, elapsed time = 2.3739 msecmsglen = 560000 bytes, elapsed time = 2.7295 msecmsglen = 640000 bytes, elapsed time = 3.0754 msecmsglen = 720000 bytes, elapsed time = 3.4746 msecmsglen = 800000 bytes, elapsed time = 3.7441 msecmsglen = 1000000 bytes, elapsed time = 4.6994 mseclatency = 50.0 microsecondsbandwidth = 212.79 MBytes/sec(approximate values for MPI_Isend/MPI_Irecv/MPI_Waitall)3. How do you find the Bandwidth and Latency from this data?

  11. Y = X/B + L : B = Bandwidth (Bytes/sec), L = Latency

  12. 5. Monte Carlo to Compute π Main Idea • Consider unit Square with Embedded Circle • Generate Random Points inside Square • Out of N trials, m points are inside circle • Then π ~ 4m/N • Error ~ 1/N • Simple to Parallelize

  13. Moldeling Method: THROW MANY DARTS FRACTION INSIDE CIRCLE = π/4 1 0 1 0

  14. MPI PROGRAM DEFINES WORKING NODES

  15. EACH NODE COMPUTES ESTIMATE OF PI INDEPENDENTLY

  16. NODE 0 COMPUTES AVERAGES AND WRITES OUTPUT

  17. #include <stdio.h>#include <math.h>#include <mpi.h>#include "MersenneTwister.h"void mcpi(int, int, int);int monte_carlo(int, int);//=========================================// Main Routine//=========================================int main(int argc, char * argv[]){ int ntasks, taskid, nworkers; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); if (taskid == 0) { printf(" #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s\n");} /*--------------------------------------------------*/ /* do monte-carlo with a variable number of workers */ /*--------------------------------------------------*/ for (nworkers=ntasks; nworkers>=1; nworkers = nworkers/2) { mcpi(nworkers, taskid, ntasks);} MPI_Finalize(); return 0;}

  18. //============================================================// Routine to split tasks into groups and distribute the work//============================================================void mcpi(int nworkers, int taskid, int ntasks){ MPI_Comm comm; int worker, my_hits, total_hits, my_trials; int total_trials = 6400000; double tbeg, tend, elapsed, rate; double pi_estimate, est_error, abs_error; /*---------------------------------------------*/ /* make a group consisting of just the workers */ /*---------------------------------------------*/ if (taskid < nworkers) worker = 1; else worker = 0; MPI_Comm_split(MPI_COMM_WORLD, worker, taskid, &comm);

  19. if (worker) { /*------------------------------------------*/ /* divide the work among all of the workers */ my_trials = total_trials / nworkers; MPI_Barrier(comm); tbeg = MPI_Wtime(); /* each worker gets a unique seed, and works independently */ my_hits = monte_carlo(taskid, my_trials); /* add the hits from each worker to get total_hits */ MPI_Reduce(&my_hits, &total_hits, 1, MPI_INT, MPI_SUM, 0, comm); tend = MPI_Wtime(); elapsed = tend - tbeg; rate = 1.0e-6*double(total_trials)/elapsed; /* report the results including elapsed times and rates */ if (taskid == 0) { pi_estimate = 4.0*double(total_hits)/double(total_trials); est_error = pi_estimate/sqrt(double(total_hits)); abs_error = fabs(M_PI - pi_estimate); printf("%6d %9d %9.5lf %9.5lf %9.5lf %8.3lf %9.2lf\n", nworkers, total_trials, pi_estimate, est_error, abs_error, elapsed, rate); } } MPI_Barrier(MPI_COMM_WORLD); }

  20. //=========================================// Monte Carlo worker routine: return hits//=========================================int monte_carlo(int taskid, int trials){ int hits = 0; int xseed, yseed; double xr, yr; xseed = 1 * (taskid + 1); yseed = 1357 * (taskid + 1); MTRand xrandom( xseed ); MTRand yrandom( yseed ); for (int t=0; t<trials; t++) { xr = xrandom(); yr = yrandom(); if ( (xr*xr + yr*yr) < 1.0 ) hits++; } return hits; }

  21. Run code in ~gyan/course/src/mpi/pi Poe pi –procs 4 –hfile hf using one node many processors #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup 4 6400000 3.14130 0.00140 0.00029 0.134 47.77 3.98 2 6400000 3.14144 0.00140 0.00016 0.267 23.96 1.997 1 6400000 3.14187 0.00140 0.00027 0.533 12.00 1.0

More Related