1 / 17

Application Performance Analysis on Blue Gene/L

Application Performance Analysis on Blue Gene/L. Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling, Ed Upchurch. Caltech’s Role in Blue Gene/L Project.

Download Presentation

Application Performance Analysis on Blue Gene/L

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling, Ed Upchurch

  2. Caltech’s Role in Blue Gene/L Project • Understand implications of BG/L network architecture & drive results from real world ASCI applications • Develop statistical models of applications, processors as message generators, and the network • Focus on • Application communications distribution • Network contention as function of load, size and adaptive routing • Represent 64K Nodes Explicitly in Statistical Model • Create trace analysis tools to characterize applications • Extensible Trace Facility (ETF)

  3. Blue Gene / L Node

  4. Blue Gene / L Network

  5. ETF Built-in Trace Options • MPI events • All point-to-point communications (MPI-1) • All collective communications (MPI-1) • Non-blocking request tracking • Communicator creation and destruction • MPI datatype decoding (requires MPI-2) • Languages: C, Fortran • Easy instrumentation of applications • Memory reference and program execution tracing • Tracking of statically and dynamically allocated arrays (identifiers, element sizes, dimensions) • Tracking of scalar variables • Read and write accesses to individual scalars and array elements as well as contiguous vectors of elements • Function calls • Program execution phases

  6. ETF Tracing Example forMagnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR) • Parallel MHD fluid code solves equations of hydrodynamics and resistive Maxwell’s equations • Part of larger application which computes dynamic responses to strong shock waves impinging on target materials • Fortran 90 + MPI • MPI Cartesian communicators • Nearest neighbor comms use non blocking send/recv • MPI Allreduce for calculating stable time steps

  7. AMR MHD: Communication Profile 20 time steps on 32 processors, 128x128 cells Max. level = 1 Max. level = 2

  8. Lennard-Jones Molecular Dynamics • Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms • production code from Sandia National Lab • Simulations are large in two dimensions • number of atoms and number of time steps • Spatial decomposition case selected • each processing node keeps track of the positions and movement of the atoms in a 3-D box • Computations carried out in a single time step correspond to femto-seconds of real time • a meaningful simulation of the evolution of the system’s state typically requires thousands of time steps • Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step • Code is written in Fortran and MPI

  9. Lennard-Jones Molecular Dynamics Communication Steps Typical Grid Cell and Cutoff Radius Computational Cycle Model

  10. LJS Single Processor BG/L Performance Original Code vs. Tuned for BG/L 12 10 good cache reuse 8 Improvement (%) 6 4 2 0 15,625 31,250 62,500 125,000 250,000 500,000 Number of Atoms per BG/L CPU

  11. LJS Molecular Dynamics Performance Fixed Problem Size of 1 Billion Atoms Compute Time [ms] Communications Time [ms] Time per single iteration (ms) 64k 2k 4k 8k 16k 32k Number of BG/L CPUs

  12. LJS Speedup BG/L vs. ASCI Red 3200 Nodes 1 Billion Atom Problem 80 70 60 50 Speedup 40 30 20 10 0 2k 4k 8k 16k 32k 64k Number of BlueGene/L Nodes

  13. LJS Communications Time 500,000 Atoms per BG/L Node 60 50 40 Communications Time Per Iteration (msecs) 30 20 Physical Nearest Neighbor Mapping Random Mapping 10 0 4x4x4 (64 BGL Nodes) 8x8x8 (512 BGL Nodes) 16x16x16 (4096 BGL Nodes) BG/L Configuration

  14. What is QMC and Why is it a Good Fit for BG/L? • QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy • Developed at Caltech by Bill Goddard’s ASCI Material Properties group • Interesting Characteristics • Low memory requirements • After initialization, highly parallel and scalable • Minimal set of MPI calls required • Non blocking p2p, reduction, probe, communicator, collective calls • No communications during QMC working steps • Communicating convergence statistics is 7200 bytes regardless of problem size and node count • Code already ported to many platforms (Linux, AIX, IRIX, etc.) • C++ and MPI sources

  15. Iterative QMC Algorithm For each processor do: Steps = Total Steps / number of processors Generate walkers Equilibrate walkers for each step generate QMC statistics send QMC statistics to master node

  16. QMC Communications Time For 100,000 Steps Per Node (Reduce Using the Torus) 1 8x8x8 (512) 16x16x16 (4K) 32x16x16 (8K) 32x32x16 (16K) 32x32x32 (32K) 64x32x32 (64K) 0.1 Time (seconds) 0.01 0.001 BG/L Configuration

  17. Future Application Porting and Analysis for BG/L • ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum • Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L • good stress test for BG/L robustness • Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBM • SP3, frost.

More Related