1 / 21

Evolution of the NERSC SP System NERSC User Services

Evolution of the NERSC SP System NERSC User Services. Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System. Original Plans: The NERSC-3 Procurement. Complete, reliable, high-end scientific system High availability and MTBF

ince
Download Presentation

Evolution of the NERSC SP System NERSC User Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution of the NERSC SP SystemNERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System

  2. Original Plans: The NERSC-3 Procurement • Complete, reliable, high-end scientific system • High availability and MTBF • Fully configured - processing, storage, software, networking, support • Commercially available components • The greatest amount of computational power for the money • Can be integrated with existing computing environment • Can be evolved with product line • Much careful benchmarking and acceptance testing done SP Evolution - NERSC User Services

  3. Original Plans: The NERSC-3 Procurement • What we wanted: • >1 teraflop of peak performance • 10 terabytes of storage • 1 terabyte of memory • What we got in phase 1 • 410 gigaflops of peak performance • 10 terabytes of storage • 512 gigabytes of memory • What we will get in phase 2 • 3 teraflops of peak performance • 15 terabytes of storage • 1 terabyte of memory SP Evolution - NERSC User Services

  4. Hardware, Phase 1 • 304 Power 3+ nodes: Nighthawk 1 • Node usage: • 256 compute/batch nodes = 512 CPUs • 8 login nodes = 16 CPUs • 16 GPFS nodes = 32 CPUs • 8 network nodes = 16 CPUs • 16 service nodes = 32 CPUs • 2 processors/node • 200 MHz clock • 4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node • 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec • 4 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec • 1 GB RAM per node @ 175 nsec & 1.6 GB/sec • 150 MB/sec switch bandwidth • 9 GB local disk (two-way RAID) SP Evolution - NERSC User Services

  5. Hardware, Phase 2 • 152 Power 3+ nodes: Winterhawk 2 • Node usage: • 128 compute/batch nodes = 2048 CPUs • 2 login nodes = 32 CPUs • 16 GPFS nodes = 256 CPUs • 2 network nodes = 32 CPUs • 4 service nodes = 64 CPUs • 16 processors/node • 375 MHz clock • 4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node • 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec • 8 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec • 8 GB RAM per node @ 175 nsec & 14.0 GB/sec • ~2000 (?) MB/sec switch bandwidth • 9 GB local disk (two-way RAID) SP Evolution - NERSC User Services

  6. Programming Models, phase 1 • Phase 1 will reply on MPI, with availability of threading • OpenMP directives • Pthreads • IBM SMP directives • MPI now does intra-node communications efficiently • Mixed-model programming not currently very advantageous • PVM and LAPI messaging systems are also available • SHMEM is “planned”… • The SP has cache and virtual memory, which means • There are more ways to reduce code performance • There are more ways to lose portability SP Evolution - NERSC User Services

  7. Programming Models, phase 2 • Phase 2 will offer more payback for mixed model programming • Single node parallelism is a good target for PVP users • Vector and shared-memory codes can be “expanded” into MPI • MPI codes can be ported from the T3E • Threading can be added within MPI • In both cases, re-engineering will be required, to exploit new and different levels of granularity • This can be done along with increasing problem sizes SP Evolution - NERSC User Services

  8. Porting Considerations, part 1 • Things to watch out for in porting codes to the SP • Cache • Not enough on the T3E to make worrying about it worth the trouble • Enough on the SP to boost performance, if it’s used well • Tuning for cache is different than tuning for vectorization • False sharing caused by cache can reduce perfomrance • Virtual memory • Gives you access to 1.75 GB of (virtual) RAM address space • To use all of virtual (or even real) memory, must explicitly request “segments” • Causes performance degradation due to paging • Data types • Default sizes are different on PVP, T3E, and SP systems • “integer”, “int”, “real”, and “float” must be used carefully • Best to say what you mean: “real*8”, integer*4” • Do the same in MPI calls: “MPI_REAL8”, “MPI_INTEGER4” • Be careful with intrinsic function use, as well SP Evolution - NERSC User Services

  9. Porting Considerations, part 2 • More things to watch out for in porting codes to the SP • Arithmetic • Architecture tuning can help exploit special processor instructions • Both T3E and SP can optimize beyond IEEE arithmetic • T3E and PVP can also do fast reduced precision arithmetic • Compiler options on T3E and SP can force IEEE compliance • Compiler options can also throttle other optimizations for safety • Special libraries offer faster intrinsics • MPI • SP compilers and runtime will catch loose usage that was accepted on the T3E • Communication bandwidth on SP Phase 1 is lower than on the T3E • Message latency on the SP Phase 1 is higher than on the T3E • We expect approximate parity with T3E in these areas, on the Phase 2 system • Limited number of communication ports per node - approximately one per CPU • “Default” versus “eager” buffer management in MPI_SEND SP Evolution - NERSC User Services

  10. Porting Considerations, part 3 • Compiling & linking • “Version” is dependent on language and parallelization scheme • Language version • Fortran 77: f77, xlf • Fortran 90: xlf90 • Fortran 95: xlf95 • C: cc, xlc, c89 • C++: xlC • MPI-included: mpxlf, mpxlf90, mpcc, mpCC • Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r • Preprocessing can be ordered by compiler flag or source file suffix • Use consistently, for all related compilations; the following may NOT produce a parallel executable: mpxlf90 -c *.F xlf90 -o foo *.o • Use-bmaxdata:bytesoption to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real) SP Evolution - NERSC User Services

  11. Porting: MPI • MPI codes should port relatively well • Use one MPI task per node or processor • One per node during porting • One per processor during production • Let MPI worry about where it’s communicating to • Environment variables, execution parameters, and/or batch options can specify • # tasks per node • Total # tasks • Total # processors • Total # nodes • Communications subsystem in use • User Space is best in batch jobs • IP may be best for interactive developmental runs • There is a debug queue/class in batch SP Evolution - NERSC User Services

  12. Porting: Shared Memory • Don’t throw away old shared memory directives • OpenMP will work as is • Cray Tasking directives will be useful for documentation • We recommend porting Cray directives to OpenMP • Even small-scale parallelism can be useful • Larger scale parallelism will be available next year • If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing • We recommend MPI SP Evolution - NERSC User Services

  13. From Loop-slicing to MPI, before... allocate(A(1:imax,1:jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax) do I = 1, imax do J = 1, jmax A(I,J) = deep_thought(A, I, J,…) enddo enddo Sanity checking • Run the program on one CPU to get baseline answers • Run on several CPUs to see parallel speedups and answers • Optimization • Consider changing memory access patterns to improve cache usage • How big can your problem get before you run out of real memory? SP Evolution - NERSC User Services

  14. From Loop-slicing to MPI, after... call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax) allocate(A(my_imin : my_imax, my_jmin : my_jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax) do I = my_imin, my_imax do J = my_jmin, my_jmax A(I,J) = deep_thought(A, I, J,…) enddo enddo ! Communicate the shared values with neighbors… if(odd(my_ID)) then call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) else call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) endif SP Evolution - NERSC User Services

  15. From Loop-slicing to MPI, after... • You now have one MPI task and many OpenMP threads per node • The MPI task does all the communicating between nodes • The OpenMP threads do the parallelizable work • Do NOT use MPI within an OpenMP parallel region • Sanity checking • Run on one node and one CPU to check baseline answers • Run on one node and several CPUs to see parallel speedup and answers • Run on several nodes, one CPU per node, and check answers • Run on several nodes, several CPUs per node, and check answers • Scaling checking • Run a larger version of a similar problem on the same set of ensemble sizes • Run the same sized problem on a larger ensemble • (Re-)Consider your I/O strategy… SP Evolution - NERSC User Services

  16. From MPI to Loop-slicing • Add OpenMP directives to existing code • Perform sanity and scaling checks, as before • Results in same overall code structure as on previous slides • One MPI task and several OpenMP threads per node • For irregular codes, Pthreads may serve better, at the cost of increased complexity • Nobody really expects it to be this easy... SP Evolution - NERSC User Services

  17. Using the Machine, part 1 • Somewhat similar to the Crays • Interactive and batch jobs are possible SP Evolution - NERSC User Services

  18. Using the Machine , part 2 • Interactive runs • Sequential executions run immediately on your login node • Every login will likely put you on a different node, so be careful about looking for your executions - “ps” returns info about only the node you’re logged into. • Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources • There are two pools of nodes that can be used for interactive jobs: • Login nodes • A small subset of the compute nodes • Parallel execution can often be achieved by • Trying again, after initial rejection • Changing communication mechanisms from User Space to IP • Using the other pool SP Evolution - NERSC User Services

  19. Using the Machine , part 3 • Batch jobs • Currently, very similar in capability to the T3E • Similar run times, processor counts • More memory available on the SP • Limits and capabilities may change, as we learn the machine • LoadLeveler is similar to, but simpler than NQE/NQS on the T3E • Jobs are submitted, monitored, and cancelled by special commands • Each batch job requires a script that is essentially a shell script • The first few lines contain batch options that look like comments to the shell • The rest of the script can contain any shell constructs • Scripts can be debugged by executing them interactively • Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time SP Evolution - NERSC User Services

  20. Using the Machine , part 4 • File systems • Use the environment variables to let the system manage your file usage • Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient) • Medium performance, node-local • Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable) • High performance, located in GPFS • HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS • There are quotas on space and inode usage SP Evolution - NERSC User Services

  21. Using the Machine , part 4 • The future? • The allowed scale of parallelism (CPU counts) may change • Max now = 512 CPUs, same as on T3E • The allowed duration of runs may change • Max now = 4 hours; Max on T3E = 12 hours • The size of possible problems will definitely change • More CPUs in phase 1 than the T3E • More memory per cpu, in both phases, than on T3e • The amount of work possible per unit time will definitely change • CPUs in both phases are faster than those on the T3E • Phase 2 interconnect will be faster than on Phase 1 • Better machine management • Checkpointing will be available • We will learn what can be adjusted in the batch system • There will be more and better tools for monitoring and tuning • HPM, KAP, Tau, PAPI... • Some current problems will go away (e.g. memory mapped files) SP Evolution - NERSC User Services

More Related