1 / 67

Introduction to Parallel Programming (Message Passing)

Introduction to Parallel Programming (Message Passing). Francisco Almeida falmeida@ull.es. Parallel Computing Group. Beowulf Computers. Distributed Memory. COTS: Commercial-Off-The-Shelf computers. The Parallel Model. PRAM. Computational Models Programming Models Architectural Models.

sakura
Download Presentation

Introduction to Parallel Programming (Message Passing)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parallel Programming (Message Passing) Francisco Almeida falmeida@ull.es Parallel Computing Group

  2. Beowulf Computers • Distributed Memory • COTS: Commercial-Off-The-Shelf computers

  3. The Parallel Model PRAM • Computational Models • Programming Models • Architectural Models BSP, LogP PVM, MPI, HPF, Threads, OPenMP Parallel Architectures

  4. processor processor processor processor Interconnection Network processor processor processor The Message Passing Model Send(parameters) Recv(parameters)

  5. Network of Workstations Hardware • Distributed Memory • Non Shared Memory Space • Star Topology • Sun Sparc Ultra 1 • 143 Mhz Etherswitch

  6. SGI Origin 2000 Hardware • Shared Dsitributed Memory • Hypercubic Topology • C4-CEPBA • 64 R1000processos • 8 Gb memory • 32 Gflop/s

  7. Digital AlphaServer 8400 Hardware • Shared Memory • BusTopology • C4-CEPBA • 10 Alpha processors21164 • 2 Gb Memory • 8,8 Gflop/s

  8. Drawbacks that arise when solving Problems using Parallelism • Parallel Programming is more complex than sequential. • Results may vary as a consequence of the intrinsic non determinism. • New problems. Deadlocks, starvation... • Is more difficult to debug parallel programs. • Parallel programs are less portable.

  9. MPI CMMD pvm Express Zipcode p4 PARMACS EUI MPI Parallel Libraries Parallel Applications Parallel Languages

  10. MPI • What Is MPI? • Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995 • What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations

  11. A Simple MPI Program MPI hello.c #include <stdio.h> #include <string.h> #include "mpi.h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize(); } Processor 2 of 4 Processor 3 of 4 Processor 1 of 4 processor 0, p = 4 greetings from process 1! greetings from process 2! greetings from process 3! mpicc –o hello hello.c mpirun –np 4 hello

  12. Basic Communication Operations

  13. . . . 0 1 p One-to-all broadcast Single-node Accumulation One-to-all broadcast M . . . 0 1 p M M M Single-node Accumulation 0 1 Step 1 2 Step 2 . . . Step p p

  14. Second Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 First Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes

  15. Third Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes

  16. MPI Broadcast • int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm comm; ); • Broadcasts a message from the process with rank "root" to all other processes of the group

  17. A6 @A7 110 A7 @A6 101 A2 @A3 101 A3 @A2 101 A0 A5@A4 101 A1 A0@A1 000 A1@A0 001 A6 @A7@A4@A5 110 A7 @A6@A5@A4 101 A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101 A7 @A6@ A5@A4 101 A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001 Reduction on Hypercubes A6 110 • @ conmutative and associative operator • Ai in processor i • Every processor has to obtain A0@A1@...@AP-1 A7 101 A2 101 A3 101 A5 101 A0 000 A1 001

  18. int MPI_Reduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;); Reduces values on all processes to a single value processes int MPI_Allreduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;); Combines values form all processes and distributes the result back to all Reductions with MPI

  19. . . . . . . 0 0 1 1 p p All-To-All BroadcastMultinode Accumulation All-to-all broadcast M1 M2 Mp M0 M0 M0 Single-node Accumulation M1 M1 M1 Mp Mp Mp Reductions, Prefixsums

  20. MPI Collective Operations MPI Operator Operation --------------------------------------------------------------- MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location

  21. The Master Slave Paradigm Master Slaves

  22. p= 1 4 dx (1+x2) 0 Computing  MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; mypi = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); mypi += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 4 2 0.0 0.2 0.4 0.6 0.8 1.0 mpirun –np 3 cpi

  23. The Portability of the Efficiency

  24. void mochila01_sec (void) { unsigned v1; int c, k; for (c = 0; c <= C; c++) f[0][c] = 0; for (k = 1; k <= N; k++) { for (c = 0; c <= C; c++) f[k][c] = f[k-1][c]; if (c >= w[k]) v1 = f[k-1][c - w[k]] + p[k]; if (f[k][c] > v1) f[k][c] = v1; } } The Sequential Algorithm • f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k ] for C  W[k]} n . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . f[k -] 1] f[k] O(n*C)

  25. Processor k - 1 k Processor . . . . . . . . . . . . . . . . c . . . . . . . . . . . . . . . . f[k-1][c] f[k][c] The Parallel Algorithm 1:void transition (int stage) 2:{ 3: unsigned x; 4: int c, k; 5: k = stage; 6: for (c = 0; c <= C; c++) 7: f[c] = 0; 8: for (c = 0; c <= C; c++) { 9: IN(&x); 10: f[c] = max(f[c], x); 11: OUT(&f[c], 1, sizeof(unsigned)); 12: if (C >= c + w[k]) 13: f[c + w[k]] = x + p[k]; 14: } 15:} f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k]}

  26. The Evolution of the Pipeline n C

  27. The Running Time n n -1 + C C

  28. Processor Virtualization n/p Block Mapping C 2 0 1

  29. Processor Virtualization n/p Block Mapping C 2 0 1

  30. Processor Virtualization n/p C 2 0 1

  31. .... (n/p -1)C (p-1)(n/p-1)C The Running Time n/p (n/p -1)C (n/p -1)C C + nC/p = nC 2 0 1

  32. 2 0 1 Processor Virtualization n/p C

  33. .... n/p (p-1)(n/p) + nC/p = nC/p 2 0 1 The Running Time n/p n/p n/p C

  34. Block Mapping void transition (void) { unsigned c, k i, inData; for (c = 0; c <= C; c++){ IN(&inData); k = calcInitStage(); for (i = 0; i < width; k++, i++) { f[i] [c] = max(f[i][c], inData); if (c + w[k] <= C) f[i][c + w[k]] = inData + p[k]; inData = f[i][c]; } OUT(&f[i-1][c], 1, sizeof(unsigned)); } } width = N / num_proc; if (f_name < N % num_proc) /* Load Balancing */ width++; int calcInitStage( void ) { return (f_name < N % num_proc) ? f_name * width : (f_name * width) + (N % num_proc) ; }

  35. Cola Cyclic Mapping 2 0 1

  36. Cola 2 0 1 The Running Time (p-1) + n/p C

  37. Cyclic Mapping int bands = num_bands(n); for (i = 0; i < bands; i++) { stage = f_name + i * num_proc; if (stage <= n - 1) transition (stage); } unsigned num_bands (unsigned n) { float aux_f; unsigned aux; aux_f = (float) n / (float) num_proc; aux = (unsigned) aux_f; if (aux_f > aux) return (aux + 1); return (aux); } void transition (int stage) { unsigned x; int c, k; k = stage; for (c = 0; c <= C; c++) f[c] = 0; for (c = 0; c <= C; c++) { IN(&x); f[c] = max(f[c], x); OUT(&f[c], 1, sizeof(unsigned)); if (C >= c + w[k]) f[c + w[k]] = x + p[k]; } }

  38. Advantages and Disadvantages • Block Distribution: • Minimizes the Number of Communications • Penalizes the Startup Time of the Pipeline • Cyclic Distribution: • Minimizes the Startup Time of the Pipeline • May Produce Communications Overhead

  39. Transputer Network - Local Area Network • Transputer Network • Fine Grain • Parallel Communications • Local Area Network • Coarse Grain • Serial Communications

  40. Computational Results Transputers Local Area Network Time Time Processors Processos

  41. The Resource Allocation Problem • M units of an indivisible Resource and a set of N Tasks. • fj(x)Benefit obtained when x unidades of resource are allocated to task j. N max f ( x ) å j j = j 1 N = Subject to x M å j = j 1 £ £ 0 x B , j j = Î xj integer, j 1 , . . . , N ; M , Bj N

  42. RAP- The Sequential Algorithm G[k][m] = max{G[k-1][m-i] + fk(i) / 0  i  m } int rap_seq(void) { int i, k, m; for (m = 0; m <= M; n++) G[0][m] = 0; q = a; Q = b; for(k = 0; k < N; k++) { for(m = 0; m <= M; m++) { for (i = 0; i <= m; i++) G[k][m] = max{G[k][m], G[k-1][i] + f[k](m- i)}; } return G[N ][M]; } O(nM2)

  43. Processor k - 1 k Processor . . . . . . . . . . . . . . . . m . . . . . . . . . . . . . . . . G[k-1][m] G[k][m] RAP - The Parallel Algorithm 1:void transition (int stage) 2:{ 3: int m, j, x, k; 4: for( m = 0; m <= M; m++) 5: G[m] = 0; 6: k = stage; 7: for (m = 0; m <= M; m++) { 8: IN(&x); 9: G[m] = max(G[m], x + f(k-1, 0)); 10: OUT(&G[m], 1, sizeof(int)); 11: for (j = m + 1; j <= M; j++) 12: G[j] = max(G[j], x + f(k - 1, j - m)); 13: } /* for m ... */ 14: } /* transition */ G[k][m] = max{G[k-1][m-i] + fk(i) / 0  i  m }

  44. The Cray T3E • CRAY T3E • Shared Address Space • Three-Dimensional Toroidal Network

  45. Cola Block - Cyclic Mapping g(p-1) + gM2 n/gp 2 0 1

  46. 120 100 10x100 80 100x1000 Time 60 400x1000 45 40 1000x1000 40 35 20 2 30 4 25 0 Time 20 8 2 4 8 16 15 16 Processsors 10 5 0 1 2 5 10 20 40 Grain Computational Results

  47. Linear Model to Predict Communication Performance Time to send N bytes=  n + b

  48. PAPI • http://icl.cs.utk.edu/projects/papi/ • PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors.

  49. OUT IN Buffering Data Virtual Process nameruns of real processor fname if (name / grain) mod p = fname Processor 1 Processor 0 Processor0 P = 2 Grain = 3 0 1 7 2 3 6 8 ... 4 5 Virtual Processes Size = B SET_BUFIO(1, size)

More Related