1 / 21

LAMMPS Strong Scaling Performance Optimization on Blue Gene/Q

LAMMPS Strong Scaling Performance Optimization on Blue Gene/Q. Paul Coffman pkcoff@us.ibm.com. Quick Bio and Disclaimer. Member of Blue Gene /Q software system test team at IBM Rochester, MN

aminia
Download Presentation

LAMMPS Strong Scaling Performance Optimization on Blue Gene/Q

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LAMMPS Strong Scaling Performance Optimization on Blue Gene/Q Paul Coffman pkcoff@us.ibm.com

  2. Quick Bio and Disclaimer • Member of Blue Gene /Q software system test team at IBM Rochester, MN • Relevant experience includes porting, testing, benchmarking applications for Argonne acceptance of 10 petaflop machine December 2012 • Essentially a personal side-project at the high-level direction of Nick Romero and collaboration with Wei Jiang and Venkat Vishwanath (ALCF staff) at Argonne off and on since February 2013 • Although I am an IBM employee and I work on the software system test team for Blue Gene /Q, all work done on this optimization project has been on my own time as an educational endeavor into performance engineering and is not officially recognized or supported by IBM -- all information presented herein is not representative of IBM • Work in progress

  3. Motivations and focus • Extensive user base for LAMMPS at Argonne on 10 petaflop BlueGene /Q system (Mira) • Strong scaling improvement needed to support longer simulation time scales and faster MD timestepping • Computational improvements targeted at short-range pairwise force fields and long-range electrostatic PPPM • IO improvement needed due to increasing simulation sizes and need to study evolution of the system over time – can exceed computation time • IO improvements targeted at restart and dump commands

  4. Benchmark used for computational performance tuning and testing • 8.4 million atom poly-iso-propyl-acrylamide (PNIPAM ) simulation • Temperature sensitive polymer that undergoes a coil-to-globule transition as the temperature is raised • Uses lj/charmm/coul/long/omp for short-range pairwise potential and PPPM for long-range Coulomb • Targets for performance optimization • Running with less MPI ranks and more threads alleviates PPPM messaging bottleneck with larger blocks so OpenMP first target for optimization • 64 threads across 16 cores on each node – 1 rank-per-node gives least number of ranks

  5. OpenMP optimum settings • Threaded code regions are entered and exited several times per time step, time steps are executed repeatedly so there is constant overhead of thread synchronization (waking and sleeping) • The combination of BG_SMP_FAST_WAKEUP=YES and OMP_WAIT_POLICY=ACTIVE effectively keeps the threads running and holds them in a barrier while non-threaded regions of the code are being executed • Effect of these setting magnified by the number of threads • Greatest effect at 1 rank-per-node and 64 threads

  6. OpenMP Thread data reduction prefetching • Files changed: thr_data.cpp • Basecode OpenMP implementation of force fields employs use of shared array of doubles strided by the thread id • At force field algorithm completion, data needs to be reduced (summed) from per-thread basis to a single data set • Significant performance impact for larger numbers of threads • Exploit the L1-Dcache line size of 64-bytes (8 doubles) • Manually unroll inner loop to process 8 doubles at a time – data is contiguous – compiler ineffective at doing this • Process 8 data elements instead of 1 every iteration • Only paying to load the first element into L1, next 7 are free, data is contiguous so L1 Prefetcher can be effectively utilized • Thread data reduction function 1.8x speedup for 64 threads • Easily modified to match cache-line size of other architectures

  7. PairLJCharmmCoulLongOMP::compute speedup issues • Speedup composed of moving data up the memory latency hierarchy mainly via thread data reduction optimization • L2 latency of 84 cycles relatively harsh penalty for L1 miss • Effective QPX simdization not possible due to format of the data • Only possible within each iteration where data in triplets for 3-dimensions, need quadruplets • Not possible across iterations due to non-contiguous data access pattern • neighbor lists use indexes into arrays for all atomic data and ghosts owned by proc, use of precomputed forces via lookup table • Added some intrinsic compiler built-in prefetching calls (__prefetch_by_load, __dcbt) at key points for minor improvement • With optimizations only achieves 35% of thoretical peak - Instructions per cycle (IPC) still low (0.58 with 60/40 Integer-load-store/Floating-Point mix • Due mainly to high L1 misses (13%) from discontiguous data access pattern described above • 1.25x speedup achieved – room for improvement with significant application data restructuring effort

  8. PPPM optimization overview • Functions changed: remap_3d_create_plan, remap_3d (C++ wrappers around them) • FFT grid data transposition from cubic sub-bricks to pencils between 1D FFT calls during 3D Parrallel FFT algorithm - significant messaging bottleneck at scale • Replace point-to-point MPI communication with optimized collective • OpenMP computation time also helped by thread data reduction optimizations • Added some intrinsic compiler built-in prefetching calls (__prefetch_by_load, __dcbt) at key points for minor improvement

  9. remap_3d_create_plan and remap_3d optimization details • remap_3d_create_plan creates communication plan for pencil data transposition • Each rank now has to pre-determine ALL ranks which will contribute to the pencil in order to create MPI communicator for collective, not just the ones that it needs to trade data with for basecode point-to-point • MPI_Comm_group, MPI_Group_incl with list of ranks for pencil passed in to MPI_Comm_create • remap_3d uses MPI_Alltoallv with sub-comm created in remap_3d_create_plan instead of MPI_Irecv / MPI_Send / MPI_Waitany loop (p2p) • MPI_Alltoallv is highly optimized collective taking advantage of several hardware and software optimizations in messaging layer of BlueGene /Q where all ranks in the communicator send/recv data with all other ranks – 0’s specified for send/recv counts for ranks that don’t happen to communicate • Need to transform send/recv lists for MPI_Irecv / MPI_Send / MPI_Waitany loop into contiguous buffers of data, counts and displacements for MPI_Alltoallv

  10. PMPI Metrics for PPPM Computation on 4 racks ~2x speedup

  11. All computational optimization speedup factors

  12. Blue Gene /Q IO test environment notes • Goal of IO optimization to effectively utilize hardware bandwidth • Test environment in IBM Rochester lab: 128-node block, 1 IO Node, 2 IO Links, Infiniband adapter, GPFS on DDN SFA12ke file server • Theoretical peak ~ 2.9 gb/s: 2 IO links from compute nodes to IO node at 4 gb/s, IO node to file server at 3 gb/s via infiniband, then some GPFS overhead • Control of environment – no other user on file server, no other compute nodes using IO node • Represents common configuration

  13. Benchmark used for IO performance tuning and testing • Based on sample LJ 3D melt input file from LAMMPS website - simple and fast Lennard-Jones potential on fcc lattice - 8 million atoms • Read restart, binary all atom dump, xyz all atom dump, write restart commands • Further tuning and weak-scaling planned on Mira up to 16 racks with 1 billion atoms • 8 million atoms is proportional for 128 nodes • For baseline for dump command, optimum ranks-per-file ratio determined to be 128, so run at 1 rank-per-node on 128 nodes and just use serial (non-multifile codepath) for fair comparison with optimizations

  14. MPI-IO / ROMIO notes • ROMIO is a high-performance, portable MPI-IO implementation built by Argonne • Included with MPICH2 standard library • Provides routines which allow multiple mpi ranks to read/write data to single file in parallel • Underlying filesystem interaction is posix io • Collectives provide further optimization for optimum number of posix file streams • Subset of ranks that actually write to the file system (aggregators) • MPI communication to transpose data to the aggregators • Ideally write large contiguous chunks • A lot of tuning between aggregators and fileserver is possible – on compute node side want aggregators balanced across IO links particularly at scale (future work)

  15. MPI/IO implementation details • Classes modified: WriteRestart, ReadRestart, Dump base class, and initially the DumpAtom and DumpXYZ styles • Dump file contents unchanged from base code • Restart file contents are different • Need to record rank offsets for data to be read properly • Mismatch between number write ranks and read ranks still supported • Header output unchanged via base code posix calls • Posix file pointer closed followed by MPI_File_open • Pre-allocate the file by doing MPI_File_open from rank 0, MPI_File_set_size, MPI_File_close from rank 0 • Informs the file system of the amount of data expected and therefore cuts down on the amount of control traffic • MPI_Scan used with each individual rank’s send_size to calculate the offset which is saved for later use doing the actual write, the last rank (nproc-1) sends it’s recv_buf to rank 0 to calculate file size

  16. MPI/IO implementation details continued • MPI_File_write_at_all collective utilized and tuned for the per-rank data • takes byte-offset into file (need to explicitly compute offset for each rank) • MPI Primitive data types written (int’s, longs and double’s) • Each rank writes 1 contiguous chunk of data • No need for derived types and file views which have performance overhead • BGLOCKLESSMPIO_F_TYPE=0x47504653 environment variable setting • Disables file locking in the MPI-IO layer relying solely on the GPFS layer • Text dumps basecode uses fprintf (computationally expensive) 1 line at a time • Dump code is single-threaded, so replace with OpenMP-threaded sprintf to an array of buffers which are concatenated for 1 large contiguous write • Significant computational performance improvement especially when using a lot of threads

  17. IO benchmark results – better but still room for improvement

  18. Conclusion • Work in progress • Significant performance bottlenecks identified and lessened in computation, communication and IO on BlueGene /Q • With the exception of the OpenMP parameter changes and intrinsic prefetching calls, all optimizations should be applicable to other platforms with varying results

  19. Additional Material

  20. Development environment and methodologies • All work done in Rochester development lab (various systems totalling ~4 racks of nodes) and Argonne Mira • XL C++ cross-compiler for Blue Gene/Q version 12.1 with –O3 -qhot (performs high-order loop analysis and transformations) and -qunroll (tells the compiler to more aggressively search for opportunities for loop unrolling) optimization options • Profiling: LAMMPS metrics, gprof, internally developed PMPI libraries, hardware performance counter event gathering, manual cycle counting (GetTimeBase), compiler listings of assembly code

  21. Blue Gene/Q Computational Hardware Performance Notes • System-on-Chip design combining CPUs, caches, network, and a messaging unit on a single chip. • There are sixteen 64-bit PowerPC A2 cores running at 1.6 GHz available for application usage, each with 4 hardware threads, supporting a total of up to 64 OpenMP threads per node. • Each core has a SIMD quad floating point unit (QPX), an L1 prefetching unit (L1P) supporting both linear stream and list prefetching, and a wake-up unit to reduce thread synchronization overhead. There are two execution units: Integer/Load/Store and Floating-Point which can each compute one instruction per cycle • Each core has its own 16 KB L1 data cache and shares a 32MB L2 cache connected to all cores. • Memory hierarchy latency:L1 is 6 cycles, L1P is 24 cycles, L2 is ~82 cycles, and main memory is ~350 cycles • QPX unit allows for 4 FMAs per cycle, translating to a peak performance per core of 12.8 GFlops, or 204.8 GFlops for the node. • Each node contains a compute chip and 16 GB of DDR3 memory. A node board is composed of 32 nodes 16 node boards compose a midplane, and 2 midplanes compose a rack, totaling 1024 nodes ~210 TF • Node-to-node communication is facilitated by a 5-d torus network. Each compute node has a total of 20 communication links (10 send and 10 receive links) with a combined bandwidth of 40 GB/second.[2]

More Related