Leading computational methods on modern parallel systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Leading Computational Methods on Modern Parallel Systems PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Leading Computational Methods on Modern Parallel Systems. Leonid Oliker Lawrence Berkeley National Laboratory Joint work with: Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley. Motivation.

Download Presentation

Leading Computational Methods on Modern Parallel Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Leading computational methods on modern parallel systems

Leading Computational Methods on Modern Parallel Systems

Leonid Oliker

Lawrence Berkeley National Laboratory

Joint work with:

Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley



  • Stagnating application performance is well-know problem in scientific computing

  • By end of decade numerous mission critical applications expected to have 100X computational demands of current levels

  • Many HEC platforms are poorly balanced for demands of leading applications

    • Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology

  • Traditional superscalar trends slowing down

    • Mined most benefits of ILP and pipelining,Clock frequency limited by power concerns

  • Achieving next order of magnitude in computing will require O(100K) - O(1M) processors: interconnect that scale nonlinearly in cost (crossbar, fattree) become impractical

Application evaluation

Application Evaluation

  • Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency

  • However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches

  • Our evaluation work emphasizes full applications, with real input data, at the appropriate scale

  • Requires coordination of computer scientists and application experts from highly diverse backgrounds

    • Allows advanced optimizations for given architectural platforms

  • In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale

  • Our recent efforts have focused on the potential of modern parallel vector systems

  • Effective code vectorization is an integral part of the process

    • First US team to conduct Earth Simulator performance study

Vector paradigm

Vector Paradigm

  • High memory bandwidth

    • Allows systems to effectively feed ALUs (high byte to flop ratio)

  • Flexible memory addressing modes

    • Supports fine grained strided and irregular data access

  • Vector Registers

    • Hide memory latency via deep pipelining of memory load/stores

  • Vector ISA

    • Single instruction specifies large number of identical operations

  • Vector architectures allow for:

    • Reduced control complexity

    • Efficiently utilize large number of computational resources

    • Potential for automatic discovery of parallelism

  • Can some of these features can be added to modern micros?

    However: most effective if sufficient regularity discoverable in program structure

    • Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural comparison

Architectural Comparison

  • Custom vectors : High Bandwidth, Flexible addressing, Many Register, vector ISA

    • Less control complexity, use many ALUs, auto parallelism

    • ES shows best balance between memory and peak performance

    • Data caches of superscalar systems and X1(E) potential reduce mem costs

    • X1E: 2 MSP’s per MCM - increases contention for memory and interconnect

  • A key ‘balance point’ for vector systems is the scalar:vector ratio (8:1 or even 32:1)

  • Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency

  • Cost is a critical metric - however we are unable to provide such data

    • Proprietary, pricing varies based on customer and time frame

    • Poorly balanced systems cannot solve certain class of problems/resolutions regardless of processor count!

Application overview

Application Overview

  • Examining candidate ultra-scale applications with abundant data parallelism

  • Codes designed for superscalar architectures, required vectorization effort

    • ES use requires minimum vectorization and parallelization hurdles

Astrophysics cactus

Astrophysics: CACTUS

  • Numerical solution of Einstein’s equations from theory of general relativity

  • Among most complex in physics: set coupled nonlinear hyperbolic & elliptic systems w/ thousands of terms

  • CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes

  • Evolves PDE’s on regular grid using finite differences

  • Gravitational Wave Astronomy: exciting new field about to be born, w/ new fundamental insight into universe

Visualization of grazing collision of two black holes

Developed at Max Planck Institute, vectorized by John Shalf LBNL

Cactus performance

CACTUS: Performance

  • SX8 attains highest per-processor performance ever attained for Cactus: 58X Power3!

    • Vector performance related to x-dim (vector length)

    • Excellent scaling on ES/SX8 using fixed data size per proc (weak scaling)

    • Opens possibility of computations at unprecedented scale

  • X1 surprisingly poor (~6x slower than SX8) - low ratio scalar:vector

    • Unvectorized boundary, required 15% of runtime on SX8 and 30+% on X1

    • < 5% for the scalar version: unvectorized code can quickly dominate cost

    • To date unable to successfully execute code on X1E

  • Poor superscalar performance despite high computational intensity

    • Register spilling due to large number of loop variables

    • Prefetch engines inhibited due to multi-layer ghost zones calculations

Plasma physics lbmhd

Plasma Physics: LBMHD

  • LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)

  • Performs 2D/3D simulation of high temperature plasma

  • Evolves from initial conditions and decaying to form current sheets

  • Spatial grid coupled to octagonal streaming lattice

  • Block distributed over processor grid

  • Main computational components:

    • Collision, Stream, Interpolation

  • Vectorization: loop interchange, unrolling

    • X1 compiler loop reordered automatically

Evolution of vorticity into turbulent structures

Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary

Lbmhd 3d performance

LBMHD-3D: Performance

  • Not unusual to see vector achieve > 40% peak while superscalar architectures achieve ~ 10%

  • There exists plenty of computation, however large working set causes register spilling scalars

    • Examining cache-oblivious algorithms for cache-based systems

  • Unlike superscalar approach, large vector register sets hide latency

  • Opteron shows impressive superscalar performance, 2X speed vs. Itanium2

    • Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache

  • ES sustains 68% of peak up to 4800 processors: 26TFlops - SC05 Gorden Bell FinalistThe highest performance ever attained for this code by far

  • SX8 shows highest raw performance, but ES shows highest efficiency

    • SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM

  • X1E achieved same performance as X1 using original code version

    • By turning off caching resulted in about 10% improvement over X1

Analysis of bank conflicts

Analysis of Bank Conflicts

  • Comparison of bank conflicts shows that ES custom memory has clear advantages

Magnetic fusion gtc

Magnetic Fusion: GTC

  • Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)

  • Goal magnetic fusion is burning plasma power plant producing cleaner energy

  • GTC solves 3D gyroaveraged kinetic system w/ particle-in-cell approach (PIC)

  • PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid

  • Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)

  • Vectorization inhibited: multiple particles may attempt to concurrently update same point in charge deposition

Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies

Developed at PPPL, vectorized/optimized by Stephane Ethier and ORNL/Cray/NEC

Gtc particle decomposition

GTC Particle Decomposition


  • Particle charge deposited amongst nearest grid points

  • Several particles can contribute to same grid point preventing vectorization

  • Solution: VLEN copies of charge deposition array with reduction after main loop

    • Greatly increases memory footprint (8x)

  • GTC originally optimized for superscalar SMPs using MPI/OpenMP

  • However: OpenMP severely increase memory for vector implementation

    • Vectorization and thread-level parallelism compete w/ each other

  • Previous vector experiments limited to only 64-way MPI parallelism

    • 64 is optimal domains for 1D toroidal (independent of # particles)

  • Updated GTC algorithm

  • Introduces a third level of parallelism:

    • Algorithm splits particles between several processors (within 1D domain)

    • Allows increase concurrency and number of studied particles

  • Larger particle simulations allow increase resolution studies

    • Particles not subject to Courant condition (same timestep)

    • Allows multiple species calculations

Gtc performance

GTC: Performance

  • New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES)

  • Breakthrough of Tflop barrier on ES for important SciDAC code

    • 7.2 Tflop/s on 4096 ES processors

    • SX8 highest raw performance (of any architecture)

  • Opens possibility of new set of high-phase space-resolution simulations

  • Scalar architectures suffer from low CI, irregular data access, and register spilling

  • Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1

    • Opteron: on-chip memory controller and caching of FP L1 data

  • X1 suffers from overhead of scalar code portions

  • Original (unmodified) X1 version performed 12% *slower* on X1E

    • Additional optimizations increased performance by 50%!

    • Recent ORNL work increases performance additional 75%

  • SciDAC code, HPCS benchmark

Material science paratec

Material Science: PARATEC

  • First-principles quantum mechanical total energy calc using pseudopotentials & plane wave basis set

  • Density Functional Theory to calc structure & electronic properties of new materials

  • DFT calc are one of the largest consumers of supercomputer cycles in the world

  • 33% 3D FFT, 33% BLAS3, 33% Hand coded F90

  • Part of calculation in real space other in Fourier space

    • Uses specialized 3D FFT to transform wavefunction

Conduction band minimum electron state forCadmium Selenide (CdSe) quantum dot

  • Global transpose Fourier to real space Multiple

  • FFTs: reduce communication latency

  • Vectorize across (not within) FFTs

  • Custom FFT: only nonzeros sent

Developed by Andrew Canning with Louie and Cohen’s groups (LBNL, UCB)

Paratec performance

PARATEC: Performance

  • All architectures generally perform well due to computational intensity of code (BLAS3, FFT)

  • ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs

    • Main ES/SX8 advantage for this code is fast interconnect

    • Allows never before possible, high resolution simulations

    • CdSE Q-dot: Largest cell-size atomistic experiment ever run using PARATEC

      • Important uses: can be attached to organic molecules as electronic dye tags

  • SX8 achieves highest per-processor performance (>2x X1E)

  • X1/X1E shows lowest % of peak

    • Non-vectorizable code much more expensive on X1/X1E (32:1)

    • Lower bisection bandwidth to computational ratio (4D-hypercube)

    • X1 performance is comparable to Itanium2

  • Itanium2 outperforms Opteron (unlike LBMHD/GTC) because

    • PARATEC less sensitive to memory access issues (high computational intensity)

    • Opteron lacks FMA unit

    • Quadrics shows better scaling of all-to-all at large concurrencies

Atmospheric modeling fvcam

Atmospheric Modeling: FVCAM

  • CAM3.1: Atmospheric component of CCSM3.0

    • Our focus is the Finite Volume (FV) approach

  • AGCM: consists of physics (PS) and dynamical core (DC)

  • DC approximates dynamics of atmosphere

  • PS: calculates source terms to equations of motion:

    • Turbulence, radiative transfer, clouds, boundary, etc

  • DC default: Eulerian spectral transform - maps onto sphere

    • Allows only 1D decomposition in latitude

  • FVCAM grid is rectangular (longitude, latitude, level)

    • Allows 2D decomp (latitude, level) in dynamics phase

    • Singularity at pole prevents longitude decomposition

    • Dynamical eqns bounded by Lagrangian surfaces

    • Requires remapping to Eulerian reference frame

  • Hybrid (MPI/OpenMP) programming

    • MPI tasks limited by # of latitude lines

    • Increase potential parallelism and improves S2V ratio

    • Is not always effective on some platforms

Simulated Class IV hurricane at 0.5. This storm was produced solely through the chaos of the atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5.

Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley

Fvcam decomposition and vectorization

FVCAM Decomposition and Vectorization

  • Processor communication topology and volume for 1D and 2D FVCAM

    • Generated by IPM profiling tool - used to understand interconnect requirements

  • 1D approach straightforward nearest neighbor communication

  • 2D communication bulk is nearest neighbor - however:

    • Complex pattern due to vertical decomp and transposition during remapping

    • Total volume in 2D remap is reduced due to improved surface/volume ratio

  • Vectorization - 5 routines, about 1000 lines

    • Move latitude calculation to inner loops to maximize parallelism

    • Reduce number of branches, performing logical tests in advance (indirect indexing)

    • Vectorize across (not within) FFT’s for Polar filters

    • However, higher concurrency for fixed size problem limit performance of vectorized FFTs

Fvcam3 1 performance

FVCAM3.1: Performance

  • FVCAM 2D decomposition allows effective use of >2X as many processors

  • Increasing vertical discretizations (1,4,7) allows higher concurrencies

  • Results showing high resolution vector performance 361x576x26 (0.5 x 0.625)

    • X1E achieves speedup of over 4500 on P=672 - highest ever achieved

    • To date only results from 1D decomp available on SX-8

    • Power3 limited to speedup of 600 regardless of concurrency

    • Factor of at least 1000x necessary for simulation to be tractable

  • Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg

  • At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)

    • ES achieves highest sustained performance (over 10% at P=256)

    • Vectors suffer from short vector length of fixed problem size, esp FFTs

    • Superscalars generally achieve lower efficiencies/performance than vectors

  • Finer resolutions requires increased number of more powerful processors

Cosmology madcap

Cosmology: MADCAP

  • Anisotropy Dataset Computational Analysis Package

  • Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)

  • Anisotropies in the CMB contains early history of the Universe

  • Recasts problem in dense linear algebra: ScaLAPACK

  • Out of core calculation: holds ~3 of the 50 matrices in memory

Temperature anisotropies in CMB (Boomerang)

Developed by Julian Borrill, LBNL

Madcap performance

MADCAP: Performance

  • Overall performance can be surprising low, for dense linear algebra code

  • I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress

  • NERSC Power3 shows best system balance with respect to I/O

  • ES lacks high-performance parallel I/O - code rewritten to utilize local disks

  • Preliminary SX8 experiments unsuccessful due to poor I/O performance

Performance overview

% of Theoretical Peak

Performance Overview

Speedup vs. ES

  • Tremendous potential of vector systems: SX8 achieves unprecedented raw performance

    • Demonstrated importance of vector platforms: allows solutions at unprecedented scale

  • ES achieves highest efficiency on almost all test applications

    • LBMHD-3D achieves 26 TF/s using 4800 procs, GTC achieves 7.2 TF/s on 4096 ES procs, PARATEC achieves 5.5 TF/s on 2048 procs

  • Porting several of our apps to SX8 in progress:Madbench (slow I/O), FVCAM 2D decomp, Hyperclaw AMR (templated C++)

  • Investigating how to incorporate vector-like facilities into modern micros

  • Many methods unexamined (more difficult to vectorize): Implicit, Multigrid, AMR, Unstructured (Adaptive), Iterative and Direct Sparse Solvers, N-body, Fast Multiple

  • One size does not fit all: need a range of architectural designs to best fit algorithms

    • Heterogeneous supercomputing designs are on the horizon

  • Moving on to latest generation of HEC platforms: Power5, BG/L, XT3

  • Login