1 / 145

Introducción a la computación de altas prestaciones

Introducción a la computación de altas prestaciones. Francisco Almeida y F rancisco de Sande. Departamento de Estadística, I.O. y Computación Universidad de La Laguna. La Laguna, 12 de febrero de 2004. Questions. Why Parallel Computers? How Can the Quality of the Algorithms be Analyzed?

walden
Download Presentation

Introducción a la computación de altas prestaciones

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducción a la computación de altas prestaciones Francisco Almeida y Francisco de Sande Departamento de Estadística, I.O. y Computación Universidad de La Laguna La Laguna, 12 de febrero de 2004

  2. Questions • Why Parallel Computers? • How Can the Quality of the Algorithms be Analyzed? • How Should Parallel Computers Be Programmed? • Why the Message Passing Programming Paradigm? • Why de Shared Memory Programming Paradigm?

  3. OUTLINE • Introduction to Parallel Computing • Performance Metrics • Models of Parallel Computers • The MPI Message Passing Library • Examples • The OpenMP Shared Memory Library • Examples • Improvements in black hole detection using parallelism

  4. Why Parallel Computers ? • Applications Demanding more Computational Power: • Artificial Intelligence • Weather Prediction • Biosphere Modeling • Processing of Large Amounts of Data (from sources such as satellites) • Combinatorial Optimization • Image Processing • Neural Network • Speech Recognition • Natural Language Understanding • etc.. 1990 s 1980 s 1970 s Performance 1960 s Cost SUPERCOMPUTERS

  5. Top500 • www.top500.org

  6. Performace Metrics

  7. Speed-up • Ts = Sequential Run Time: Time elapsed between the begining and the end of its execution on a sequential computer. • Tp = Parallel Run Time: Time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes the execution. • Speed-up: T*s / Tp ? p • T*s = Time of the best sequential algorithm to solve the problem.

  8. Speed-up

  9. Speed-up

  10. Speed-up

  11. Speed-up Optimal Number of Processors

  12. Efficiency • In practice, ideal behavior of an speed-up equal to p is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm. • Efficiency: Measure of the fraction of time for which a processing element is usefully employed. • E = (Speed-up / p) x 100 %

  13. Efficiency

  14. Amdahl`s Law • Amdahl`s law attempt to give a maximum bound for speed-up from the nature of the algorithm chosen for the parallel implementation. • Seq = Proportion of time the algorithm needs to be spent in purely sequential parts. • Par = Proportion of time that might be done in parallel • Seq + Par = 1 (where 1 is for algebraic simplicity) • Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p) p = 1000

  15. Example • A problem to be solved many times over several different inputs. • Evaluate F(x,y,z) • x in {1 , ..., 20};y in {1 , ..., 10}; z in {1 , ..., 3}; • The total number of evaluations is 20*10*3 = 600. • The cost to evaluate F in one point (x, y, z) is t. • The total running time is t * 600. • If t is equal to 3 hours. • The total running time for the 600 evaluations is 1800 hours  75 days

  16. Speed-up

  17. Models

  18. The Sequential Model • The RAM model express computations on von Neumann architectures. • The von Neumann architecture is universally accepted for sequential computations. RAM Von Neumann

  19. The Parallel Model PRAM • Computational Models • Programming Models • Architectural Models BSP, LogP PVM, MPI, HPF, Threads, OPenMP Parallel Architectures

  20. Address-Space Organization

  21. Digital AlphaServer 8400 Hardware • Shared Memory • BusTopology • C4-CEPBA • 10 Alpha processors21164 • 2 Gb Memory • 8,8 Gflop/s

  22. SGI Origin 2000 Hardware • Shared Dsitributed Memory • Hypercubic Topology • C4-CEPBA • 64 R1000 processos • 8 Gb memory • 32 Gflop/s

  23. The SGI Origin 3000 Architecture (1/2) jen50.ciemat.es • 160 processors MIPS R14000 / 600MHz • On 40 nodes with 4 processors each • Data and instruction cache on-chip • Irix Operating System • Hypercubic Network

  24. The SGI Origin 3000 Architecture (2/2) • cc-Numa memory Architecture • 1 Gflops Peak Speed • 8 MB external cache • 1 Gb main memory each proc. • 1 Tb Hard Disk

  25. Beowulf Computers • COTS: Commercial-Off-The-Shelf computers • Distributed Memory

  26. Towards Grid Computing…. Source: www.globus.org & updated

  27. The Parallel Model PRAM • Computational Models • Programming Models • Architectural Models BSP, LogP PVM, MPI, HPF, Threads, OpenMP Parallel Architectures

  28. Drawbacks that arise when solving Problems using Parallelism • Parallel Programming is more complex than sequential. • Results may vary as a consequence of the intrinsic non determinism. • New problems. Deadlocks, starvation... • Is more difficult to debug parallel programs. • Parallel programs are less portable.

  29. The Message Passing Model

  30. processor processor processor processor Interconnection Network processor processor processor The Message Passing Model Send(parameters) Recv(parameters)

  31. MPI CMMD pvm Express Zipcode p4 PARMACS EUI MPI Parallel Libraries Parallel Applications Parallel Languages

  32. MPI • What Is MPI? • Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995 • What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations

  33. $> mpicc -o hello hello.c $> mpirun -np 4 hello Hello from processor 2 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4 Hello from processor 0 of 4 A Simple MPI Program MPI hello.c #include <stdio.h> #include "mpi.h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); printf(“Hello from processor %d of %d\n",name, p); MPI_Finalize(); }

  34. $> mpirun -np 4 helloms Processor 2 of 4 Processor 3 of 4 Processor 1 of 4 processor 0, p = 4 greetings from process 1! greetings from process 2! greetings from process 3! A Simple MPI Program MPI helloms.c #include <stdio.h> #include <string.h> #include "mpi.h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize(); }

  35. Linear Model to Predict Communication Performance Time to send N bytes=  n + b

  36. Performace Prediction: Fast, Gigabit Ethernet, Myrinet

  37. Basic Communication Operations

  38. . . . 0 1 p One-to-all broadcast Single-node Accumulation One-to-all broadcast M . . . 0 1 p M M M Single-node Accumulation 0 1 Step 1 2 Step 2 . . . Step p p

  39. Second Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 First Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes

  40. Third Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes

  41. MPI Broadcast • int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm comm; ); • Broadcasts a message from the process with rank "root" to all other processes of the group

  42. A6 @A7 110 A7 @A6 101 A2 @A3 101 A3 @A2 101 A0 A5@A4 101 A1 A0@A1 000 A1@A0 001 A6 @A7@A4@A5 110 A7 @A6@A5@A4 101 A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101 A7 @A6@ A5@A4 101 A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001 Reduction on Hypercubes A6 110 • @ conmutative and associative operator • Ai in processor i • Every processor has to obtain A0@A1@...@AP-1 A7 101 A2 101 A3 101 A5 101 A0 000 A1 001

  43. int MPI_Reduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;); Reduces values on all processes to a single value processes int MPI_Allreduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;); Combines values form all processes and distributes the result back to all Reductions with MPI

  44. . . . . . . 0 0 1 1 p p All-To-All BroadcastMultinode Accumulation All-to-all broadcast M1 M2 Mp M0 M0 M0 Single-node Accumulation M1 M1 M1 Mp Mp Mp Reductions, Prefixsums

  45. MPI Collective Operations MPI Operator Operation --------------------------------------------------------------- MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location

  46. p= 1 4 dx (1+x2) 0 Computing : Sequential double t, pi=0.0, w; long i, n = ...; double local, pi = 0.0; .... h = 1.0 / (double) n; for(i = 0; i < n; i++) { x = (i + 0.5) * h; pi += f(x); } pi *= h; 4 2 0.0 0.2 0.4 0.6 0.8 1.0

  47. p= 1 4 dx (1+x2) 0 Computing : Parallel MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; mypi = 0.0; for (i = name; i < n; i += numprocs) { x = h * (i + 0.5) *h; mypi+= f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 4 2 0.0 0.2 0.4 0.6 0.8 1.0

  48. The Master Slave Paradigm Master Slaves

  49. Condor University Wisconsin-Madison. www.cs.wisc.edu/condor • A problem to be solved many times over several different inputs. • The problem to be solved is computational expensive.

More Related