- 50 Views
- Uploaded on
- Presentation posted in: General

Leading Computational Methods on Modern Parallel Systems

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Leading Computational Methods on Modern Parallel Systems

Leonid Oliker

Lawrence Berkeley National Laboratory

Joint work with:

Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley

- Stagnating application performance is well-know problem in scientific computing
- By end of decade numerous mission critical applications expected to have 100X computational demands of current levels
- Many HEC platforms are poorly balanced for demands of leading applications
- Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology

- Traditional superscalar trends slowing down
- Mined most benefits of ILP and pipelining,Clock frequency limited by power concerns

- Achieving next order of magnitude in computing will require O(100K) - O(1M) processors: interconnect that scale nonlinearly in cost (crossbar, fattree) become impractical

- Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency
- However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches
- Our evaluation work emphasizes full applications, with real input data, at the appropriate scale
- Requires coordination of computer scientists and application experts from highly diverse backgrounds
- Allows advanced optimizations for given architectural platforms

- In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale
- Our recent efforts have focused on the potential of modern parallel vector systems
- Effective code vectorization is an integral part of the process
- First US team to conduct Earth Simulator performance study

- High memory bandwidth
- Allows systems to effectively feed ALUs (high byte to flop ratio)

- Flexible memory addressing modes
- Supports fine grained strided and irregular data access

- Vector Registers
- Hide memory latency via deep pipelining of memory load/stores

- Vector ISA
- Single instruction specifies large number of identical operations

- Vector architectures allow for:
- Reduced control complexity
- Efficiently utilize large number of computational resources
- Potential for automatic discovery of parallelism

- Can some of these features can be added to modern micros?
However: most effective if sufficient regularity discoverable in program structure

- Suffers even if small % of code non-vectorizable (Amdahl’s Law)

- Custom vectors : High Bandwidth, Flexible addressing, Many Register, vector ISA
- Less control complexity, use many ALUs, auto parallelism
- ES shows best balance between memory and peak performance
- Data caches of superscalar systems and X1(E) potential reduce mem costs
- X1E: 2 MSP’s per MCM - increases contention for memory and interconnect

- A key ‘balance point’ for vector systems is the scalar:vector ratio (8:1 or even 32:1)
- Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency
- Cost is a critical metric - however we are unable to provide such data
- Proprietary, pricing varies based on customer and time frame
- Poorly balanced systems cannot solve certain class of problems/resolutions regardless of processor count!

- Examining candidate ultra-scale applications with abundant data parallelism
- Codes designed for superscalar architectures, required vectorization effort
- ES use requires minimum vectorization and parallelization hurdles

- Numerical solution of Einstein’s equations from theory of general relativity
- Among most complex in physics: set coupled nonlinear hyperbolic & elliptic systems w/ thousands of terms
- CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes
- Evolves PDE’s on regular grid using finite differences
- Gravitational Wave Astronomy: exciting new field about to be born, w/ new fundamental insight into universe

Visualization of grazing collision of two black holes

Developed at Max Planck Institute, vectorized by John Shalf LBNL

- SX8 attains highest per-processor performance ever attained for Cactus: 58X Power3!
- Vector performance related to x-dim (vector length)
- Excellent scaling on ES/SX8 using fixed data size per proc (weak scaling)
- Opens possibility of computations at unprecedented scale

- X1 surprisingly poor (~6x slower than SX8) - low ratio scalar:vector
- Unvectorized boundary, required 15% of runtime on SX8 and 30+% on X1
- < 5% for the scalar version: unvectorized code can quickly dominate cost
- To date unable to successfully execute code on X1E

- Poor superscalar performance despite high computational intensity
- Register spilling due to large number of loop variables
- Prefetch engines inhibited due to multi-layer ghost zones calculations

- LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)
- Performs 2D/3D simulation of high temperature plasma
- Evolves from initial conditions and decaying to form current sheets
- Spatial grid coupled to octagonal streaming lattice
- Block distributed over processor grid
- Main computational components:
- Collision, Stream, Interpolation

- Vectorization: loop interchange, unrolling
- X1 compiler loop reordered automatically

Evolution of vorticity into turbulent structures

Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary

- Not unusual to see vector achieve > 40% peak while superscalar architectures achieve ~ 10%
- There exists plenty of computation, however large working set causes register spilling scalars
- Examining cache-oblivious algorithms for cache-based systems

- Unlike superscalar approach, large vector register sets hide latency
- Opteron shows impressive superscalar performance, 2X speed vs. Itanium2
- Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache

- ES sustains 68% of peak up to 4800 processors: 26TFlops - SC05 Gorden Bell FinalistThe highest performance ever attained for this code by far
- SX8 shows highest raw performance, but ES shows highest efficiency
- SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM

- X1E achieved same performance as X1 using original code version
- By turning off caching resulted in about 10% improvement over X1

- Comparison of bank conflicts shows that ES custom memory has clear advantages

- Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)
- Goal magnetic fusion is burning plasma power plant producing cleaner energy
- GTC solves 3D gyroaveraged kinetic system w/ particle-in-cell approach (PIC)
- PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid
- Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)
- Vectorization inhibited: multiple particles may attempt to concurrently update same point in charge deposition

Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies

Developed at PPPL, vectorized/optimized by Stephane Ethier and ORNL/Cray/NEC

Vectorization:

- Particle charge deposited amongst nearest grid points
- Several particles can contribute to same grid point preventing vectorization
- Solution: VLEN copies of charge deposition array with reduction after main loop
- Greatly increases memory footprint (8x)

- GTC originally optimized for superscalar SMPs using MPI/OpenMP
- However: OpenMP severely increase memory for vector implementation
- Vectorization and thread-level parallelism compete w/ each other

- Previous vector experiments limited to only 64-way MPI parallelism
- 64 is optimal domains for 1D toroidal (independent of # particles)

- Updated GTC algorithm
- Introduces a third level of parallelism:
- Algorithm splits particles between several processors (within 1D domain)
- Allows increase concurrency and number of studied particles

- Larger particle simulations allow increase resolution studies
- Particles not subject to Courant condition (same timestep)
- Allows multiple species calculations

- New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES)
- Breakthrough of Tflop barrier on ES for important SciDAC code
- 7.2 Tflop/s on 4096 ES processors
- SX8 highest raw performance (of any architecture)

- Opens possibility of new set of high-phase space-resolution simulations
- Scalar architectures suffer from low CI, irregular data access, and register spilling
- Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1
- Opteron: on-chip memory controller and caching of FP L1 data

- X1 suffers from overhead of scalar code portions
- Original (unmodified) X1 version performed 12% *slower* on X1E
- Additional optimizations increased performance by 50%!
- Recent ORNL work increases performance additional 75%

- SciDAC code, HPCS benchmark

- First-principles quantum mechanical total energy calc using pseudopotentials & plane wave basis set
- Density Functional Theory to calc structure & electronic properties of new materials
- DFT calc are one of the largest consumers of supercomputer cycles in the world
- 33% 3D FFT, 33% BLAS3, 33% Hand coded F90
- Part of calculation in real space other in Fourier space
- Uses specialized 3D FFT to transform wavefunction

Conduction band minimum electron state forCadmium Selenide (CdSe) quantum dot

- Global transpose Fourier to real space Multiple
- FFTs: reduce communication latency
- Vectorize across (not within) FFTs
- Custom FFT: only nonzeros sent

Developed by Andrew Canning with Louie and Cohen’s groups (LBNL, UCB)

- All architectures generally perform well due to computational intensity of code (BLAS3, FFT)
- ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs
- Main ES/SX8 advantage for this code is fast interconnect
- Allows never before possible, high resolution simulations
- CdSE Q-dot: Largest cell-size atomistic experiment ever run using PARATEC
- Important uses: can be attached to organic molecules as electronic dye tags

- SX8 achieves highest per-processor performance (>2x X1E)
- X1/X1E shows lowest % of peak
- Non-vectorizable code much more expensive on X1/X1E (32:1)
- Lower bisection bandwidth to computational ratio (4D-hypercube)
- X1 performance is comparable to Itanium2

- Itanium2 outperforms Opteron (unlike LBMHD/GTC) because
- PARATEC less sensitive to memory access issues (high computational intensity)
- Opteron lacks FMA unit
- Quadrics shows better scaling of all-to-all at large concurrencies

- CAM3.1: Atmospheric component of CCSM3.0
- Our focus is the Finite Volume (FV) approach

- AGCM: consists of physics (PS) and dynamical core (DC)
- DC approximates dynamics of atmosphere
- PS: calculates source terms to equations of motion:
- Turbulence, radiative transfer, clouds, boundary, etc

- DC default: Eulerian spectral transform - maps onto sphere
- Allows only 1D decomposition in latitude

- FVCAM grid is rectangular (longitude, latitude, level)
- Allows 2D decomp (latitude, level) in dynamics phase
- Singularity at pole prevents longitude decomposition
- Dynamical eqns bounded by Lagrangian surfaces
- Requires remapping to Eulerian reference frame

- Hybrid (MPI/OpenMP) programming
- MPI tasks limited by # of latitude lines
- Increase potential parallelism and improves S2V ratio
- Is not always effective on some platforms

Simulated Class IV hurricane at 0.5. This storm was produced solely through the chaos of the atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5.

Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley

- Processor communication topology and volume for 1D and 2D FVCAM
- Generated by IPM profiling tool - used to understand interconnect requirements

- 1D approach straightforward nearest neighbor communication
- 2D communication bulk is nearest neighbor - however:
- Complex pattern due to vertical decomp and transposition during remapping
- Total volume in 2D remap is reduced due to improved surface/volume ratio

- Vectorization - 5 routines, about 1000 lines
- Move latitude calculation to inner loops to maximize parallelism
- Reduce number of branches, performing logical tests in advance (indirect indexing)
- Vectorize across (not within) FFT’s for Polar filters
- However, higher concurrency for fixed size problem limit performance of vectorized FFTs

- FVCAM 2D decomposition allows effective use of >2X as many processors
- Increasing vertical discretizations (1,4,7) allows higher concurrencies
- Results showing high resolution vector performance 361x576x26 (0.5 x 0.625)
- X1E achieves speedup of over 4500 on P=672 - highest ever achieved
- To date only results from 1D decomp available on SX-8
- Power3 limited to speedup of 600 regardless of concurrency
- Factor of at least 1000x necessary for simulation to be tractable

- Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg
- At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)
- ES achieves highest sustained performance (over 10% at P=256)
- Vectors suffer from short vector length of fixed problem size, esp FFTs
- Superscalars generally achieve lower efficiencies/performance than vectors

- Finer resolutions requires increased number of more powerful processors

- Anisotropy Dataset Computational Analysis Package
- Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)
- Anisotropies in the CMB contains early history of the Universe
- Recasts problem in dense linear algebra: ScaLAPACK
- Out of core calculation: holds ~3 of the 50 matrices in memory

Temperature anisotropies in CMB (Boomerang)

Developed by Julian Borrill, LBNL

- Overall performance can be surprising low, for dense linear algebra code
- I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress
- NERSC Power3 shows best system balance with respect to I/O
- ES lacks high-performance parallel I/O - code rewritten to utilize local disks
- Preliminary SX8 experiments unsuccessful due to poor I/O performance

% of Theoretical Peak

Speedup vs. ES

- Tremendous potential of vector systems: SX8 achieves unprecedented raw performance
- Demonstrated importance of vector platforms: allows solutions at unprecedented scale

- ES achieves highest efficiency on almost all test applications
- LBMHD-3D achieves 26 TF/s using 4800 procs, GTC achieves 7.2 TF/s on 4096 ES procs, PARATEC achieves 5.5 TF/s on 2048 procs

- Porting several of our apps to SX8 in progress:Madbench (slow I/O), FVCAM 2D decomp, Hyperclaw AMR (templated C++)
- Investigating how to incorporate vector-like facilities into modern micros
- Many methods unexamined (more difficult to vectorize): Implicit, Multigrid, AMR, Unstructured (Adaptive), Iterative and Direct Sparse Solvers, N-body, Fast Multiple
- One size does not fit all: need a range of architectural designs to best fit algorithms
- Heterogeneous supercomputing designs are on the horizon

- Moving on to latest generation of HEC platforms: Power5, BG/L, XT3