1 / 37

Big Iron and Parallel Processing

Big Iron and Parallel Processing. USArray Data Processing Workshop. Original by: Scott Teige, PhD, IU Information Technology Support Modified for 2010 course by G Pavlis. September 14, 2014. Overview. How big is “Big Iron”? Where is it, what is it? One system, the details

zamora
Download Presentation

Big Iron and Parallel Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Iron and Parallel Processing USArray Data Processing Workshop Original by: Scott Teige, PhD, IU Information Technology Support Modified for 2010 course by G Pavlis September 14, 2014

  2. Overview • How big is “Big Iron”? • Where is it, what is it? • One system, the details • Parallelism, the way forward • Scaling and what it means to you • Programming techniques • Examples • Excercises USArray Data Processing Workshop

  3. What is the TeraGrid? • “… a nationally distributed cyberinfrastructure that provides leading edge computational and data services for scientific discovery through research and education…” • One of several consortiums for high performance computing supported by the NSF USArray Data Processing Workshop

  4. Some TeraGrid Systems USArray Data Processing Workshop

  5. System Layout USArray Data Processing Workshop

  6. Availability USArray Data Processing Workshop

  7. IU Research Cyberinfrastructure The Big Picture: • Compute • Big Red (IBM e1350 Blade Center JS21) • Quarry (IBM e1350 Blade Center HS21) • Storage • HPSS • GPFS • OpenAFS • Lustre • Lustre/WAN USArray Data Processing Workshop

  8. High Performance Systems Big Red 30 TFLOPS IBM JS21 SuSE Cluster 768 blades/3072 cores: 2.5 GHz PPC 970MP 8GB Memory, 4 cores per blade Myrinet 2000 LoadLeveler & Moab Quarry 7 TFLOPS IBM HS21 RHEL Cluster 140 blades/1120 cores: 2.0 GHz Intel Xeon 5335 8GB Memory, 8 cores per blade 1Gb Ethernet (upgrading to 10Gb) PBS (Torque) & Moab USArray Data Processing Workshop

  9. Data Capacitor (AKA Lustre) High Performance Parallel File system -ca 1.2PB spinning disk -local and WAN capabilities SC07 Bandwidth Challenge Winner -moved 18.2 Gbps across a single 10Gbps link Dark side: likes large files, performs badly on large numbers of files and for simple commands like “ls” on a directory USArray Data Processing Workshop

  10. HPSS • High Performance Storage System • ca. 3 PB tape storage • 75 TB front-side disk cache • Ability to mirror data between IUPUI and IUB campuses USArray Data Processing Workshop

  11. Practical points • If you are doing serious data processing NSF cyberinfrastructure systems have major advantages • State of the art compute servers • Large capacity data storage • Archival storage for data backup • Dark side: • Shared resource • Have to work through remote sysadmins • Commercial software (e.g. matlab) can be a issue USArray Data Processing Workshop

  12. Parallel processing • Why it matters • Single CPU systems are reaching their limit • Multiple CPU desktops are the norm already • All current HPC = parallel processing • Dark side • Still requires manual coding changes (i.e. not yet common for code to automatically be parallel) • Lots of added complexity USArray Data Processing Workshop

  13. Calculation Flow Control I/O Calculation Flow Control I/O Synchronization Communication Serial vs. Parallel USArray Data Processing Workshop

  14. 1-F 1-F A Serial Program F/N Amdahl’s Law: F S=1/(1-F+F/N) Special case, F=1 S=N, Ideal Scaling USArray Data Processing Workshop

  15. Speed for various scaling rules “Paralyzable Process” S=Ne -(N-1)/q “Superlinear Scaling” S>N USArray Data Processing Workshop

  16. Architectures • Shared memory • These imacs are shared memory machines with 2 processors • Each cpu can address the same RAM • Distributed memory • Blades(nodes)=motherboard in a rack • Each blade has it’s own RAM • Clusters have fast network to link nodes • All modern HPC systems are both (each blade uses multicore processor) USArray Data Processing Workshop

  17. Current technologies • Threads • Low level functionality • Good for raw speed on desktop • Mainly for the hard core nerd like me • So will say no more today • OpenMP • MPI USArray Data Processing Workshop

  18. MPI code may execute across many nodes Entire program is replicated for each core (sections may or may not execute) Variables not shared Typically requires structural modification to code OpenMP code executes only on the set of cores sharing memory Simplified interface to pthreads Sections of code may be parallel or serial Variables may be shared Incremental parallelization is easy MPI vs. OpenMP USArray Data Processing Workshop

  19. Let’s look first at OpenMP • Who has heard of the old fashioned “fork” procedure (part of unix since 1970s)? • What is a “thread” then and how is it different from a fork? • OpenMP is a simple, clean way to spawn and manage a collection of threads USArray Data Processing Workshop

  20. OPENMP Getting Started Exercise Preliminaries: In terminal window cd to test directory Export OMP_NUM_THREADS=8 icc omp_hello.c –openmp –o hello Run her: ./hello Look at the source code together and discuss Run a variant: export OMP_NUM_THREADS=20 ./hello Fork … Join USArray Data Processing Workshop

  21. PROGRAM DOT_PRODUCT INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), RESULT ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO RESULT= 0.0 CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE(I) !$OMP& SCHEDULE(STATIC,CHUNK) !$OMP& REDUCTION(+:RESULT) DO I = 1, N RESULT = RESULT + (A(I) * B(I)) ENDDO !$OMP END PARALLEL DO NOWAIT PRINT *, 'Final Result= ', RESULT END You can even use this in, yes, FORTRAN Fork … Join USArray Data Processing Workshop

  22. Some basic issues in parallel codes • Synchronization • Are tasks of each thread balanced? • Tie up CPU if it is waiting for other threads to exit • Shared memory means two threads can try to alter the same data • Traditional threads use a mutex • OpenMP uses a simpler method (hang on – next slides) USArray Data Processing Workshop

  23. OpenMP Synchronization Constructs • MASTER: block executed only by master thread • CRITICAL: block executed by one thread at a time • BARRIER: each thread waits until all threads reach the barrier • ORDERED: block executed sequentially by threads USArray Data Processing Workshop

  24. Data Scope Attribute Clauses • SHARED: variable is shared across all threads – mutex automatically created • PRIVATE: variable is replicated in each thread (not protected by a mutex – faster) • DEFAULT: change the default scoping of all variables in a region USArray Data Processing Workshop

  25. Some Useful Library routines • omp_set_num_threads(integer) • omp_get_num_threads() • omp_get_max_threads() • omp_get_thread_num() • Others are implementation dependent USArray Data Processing Workshop

  26. OpenMP Advice • Always explicitly scope variables • Never branch into/out of a parallel region • Never put a barrier in an if block • Avoid i/o in a parallel loop (nearly guarantees a load imbalance) USArray Data Processing Workshop

  27. Exercise 2: OpenMP • The example programs are in ~/OMP_F_examples or ~/OMP_C_examples • Go to https://computing.llnl.gov/tutorials/openMP/ • Skip to step 4, compiler is “icc” or “ifort” • Work on this until I call end USArray Data Processing Workshop

  28. Next topic: MPI • MPI=Message Passing Interface • Can be used on a multicore CPU, but main application is for multiple nodes • Next slide is source code for mpi hello world program we’ll run in a minute USArray Data Processing Workshop

  29. #include <stdio.h> #include <stdlib.h> #include <mpi.h> int myrank; int ntasks; int main(int argc, char **argv) { /* Initialize MPI */ MPI_Init(&argc, &argv); /* get number of workers */ MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /* Find out my identity in the default communicator each task gets a unique rank between 0 and ntasks-1 */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Barrier(MPI_COMM_WORLD); fprintf(stdout,"Hello from MPI_BABY=%d\n",myrank); MPI_Finalize(); exit(0); } Node 1 Node 2 … … … USArray Data Processing Workshop

  30. Running mpi_baby cp –r /N/dc/scratch/usarray/MPI . mpicc mpi_baby.c –o mpi_baby mpirun –np 8 mpi_baby mpirun –np 32 –machinefile my_list mpi_baby USArray Data Processing Workshop

  31. C AUTHOR: Blaise Barney program scatter include 'mpif.h' integer SIZE parameter(SIZE=4) integer numtasks, rank, sendcount, recvcount, source, ierr real*4 sendbuf(SIZE,SIZE), recvbuf(SIZE) C Fortran stores this array in column major order, so the C scatter will actually scatter columns, not rows. data sendbuf /1.0, 2.0, 3.0, 4.0, & 5.0, 6.0, 7.0, 8.0, & 9.0, 10.0, 11.0, 12.0, & 13.0, 14.0, 15.0, 16.0 / call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr) if (numtasks .eq. SIZE) then source = 1 sendcount = SIZE recvcount = SIZE call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf, & recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr) print *, 'rank= ',rank,' Results: ',recvbuf else print *, 'Must specify',SIZE,' processors. Terminating.' endif call MPI_FINALIZE(ierr) end From the man page: MPI_Scatter - Sends data from one task to all tasks in a group … message is split into n equal segments, the ith segment is sent to the ith process in the group USArray Data Processing Workshop

  32. Some linux tricks to get more information: man -w MPI ls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3 MPI_Abort MPI_Allgather MPI_Allreduce MPI_Alltoall ... MPI_Wait MPI_Waitall MPI_Waitany MPI_Waitsome mpicc --showme /N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc \ -I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include \ -pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib \ -lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl \ -Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64 USArray Data Processing Workshop

  33. MPI Advice • Never put a barrier in an if block • Use care with non-blocking communication, things can pile up fast USArray Data Processing Workshop

  34. So, can I use MPI with OpenMP? • Yes you can; extreme care is advised • Some implementations of MPI forbid it • You can get killed by “oversubscription” real fast, I’ve (Scott) seen time increase like N2 • But sometimes you must… some fftw libraries are OMP multithreaded, for example. • As things are going this caution likely to disappear USArray Data Processing Workshop

  35. Exercise: MPI • Examples are in ~/MPI_F_examples or ~/MPI_C_examples • Go to https://computing.llnl.gov/tutorials/mpi/ • Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal (serial) compilers are “ifort” and “icc”. • Compile your code: “make all” (Overrides section 9) • To run an mpi code: “mpirun –np 8 <exe>” …or… • “mpirun –np 16 –machinefile <ask me> <exe>” • Skip section 12 • There is no evaluation form. USArray Data Processing Workshop

  36. Where were those again? • https://computing.llnl.gov/tutorials/openMP/excercise.html • https://computing.llnl.gov/tutorials/mpi/exercise.html USArray Data Processing Workshop

  37. Acknowledgements • This material is based upon work supported by the National Science Foundation under Grant Numbers 0116050 and 0521433. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF). • This work was support in part by the Indiana Metabolomics and Cytomics Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc. • This work was support in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment, Inc. • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. USArray Data Processing Workshop

More Related