- 465 Views
- Uploaded on

Download Presentation
## Overview of HPC – Eye Towards Petascale Computing

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Overview of HPC – Eye Towards Petascale Computing

Amit Majumdar

Scientific Computing Applications Group

San Diego Supercomputer Center

University of California San Diego

Topics

- Supercomputing in General
- Supercomputers at SDSC
- Eye Towards Petascale Computing

DOE, DOD, NASA, NSF Centers in US

- DOE National Labs - LANL, LNNL, Sandia
- DOE Office of Science Labs – ORNL, NERSC
- DOD, NASA Supercomputer Centers
- National Science Foundation supercomputer centers for academic users
- San Diego Supercomputer Center (UCSD)
- National Center for Supercomputer Applications (UIUC)
- Pittsburgh Supercomputer Center (Pittsburgh)
- Others at Texas, Indiana-Purdue, ANL-Chicago

Wisc

Cornell

Utah

Iowa

Caltech

USC-ISI

UNC-RENCI

TeraGrid: Integrating NSF CyberinfrastructureUC/ANL

PU

NCAR

PSC

IU

NCSA

ORNL

SDSC

TACC

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

Measure of Supercomputers

- Top 500 list (HPL code performance)
- Is one of the measures, but not the measure
- Japan’s Earth Simulator (NEC) was on top for 3 years
- In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOP on HPL, 367 TFLOP peak
- First 100 TFLOP sustained on a real application last year
- Very recently 200+ TFLOP sustained on a real application
- New HPCC benchmarks
- Many others – NAS, NERSC, NSF, DOD TI06 etc.
- Ultimate measure is usefulness of a center for you – enabling better or new science through simulations on balanced machines

Top500 Benchmarks

- 27thTop 500 – June 2006
- NSF Supercomputer Centers in Top500

Historical Trends in Top500

- 1000 X increase in top machine power in 10 years

Other Benchmarks

- HPCC – High Performance Computing Challenge benchmarks – no rankings
- NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered)
- DoD HPCMP – TI06 benchmarks

Capability Computing

Full power of a machine is used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance

Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution

E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models

Capacity Computing

Modest problems are tackled, often simultaneously, on a machine, each with less demanding requirements

Smaller or cheaper systems are used for capacity computing, where smaller problems are solved

Parametric studies or to explore design alternatives

The main figure of merit is sustained performance per unit cost

Strong Scaling

For a fixed problem size how does the time to solution vary with the number of processors

Run a fixed size problem and plot the speedup

When scaling of parallel codes is discussed it is normally strong scaling that is being referred to

Weak Scaling

How the time to solution varies with processor count with a fixed problem size per processor

Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count

Deviations from this indicate that either

The algorithm is not truly O(N) or

The overhead due to parallelism is increasing, or both

Weak Vs Strong Scaling Examples

- The linked cell algorithm employed in DL_POLY 3 [1] for the short ranged forces should be strictly O(N) in time.
- Study the weak scaling of three model systems (two shown next), the times being reported for HPCx, a large IBM P690+ cluster sited at Daresbury.
- http://www.cse.clrc.ac.uk/arc/dlpoly_scale.shtml
- I.J.Bush and W.Smith, CCLRC Daresbury Laboratory

Weak scaling for Argon is shown. The smallest system size is 32,000 atoms, the largest 32,768,000. It can be seen that the scaling is very good, the time step increasing from 0.6s to 0.7s on going from 1 processor to 1024. This simulation is a direct test of the linked cell algorithm as it only requires short ranged forces, and so the results show it is behaving as expected.

Weak scaling for water. The time step increasing from 1.9 second on 1 processor, where the system size is 20,736 particles, to 3.9 on 1024 ( system size 21,233,664 ). Ewald terms must also be calculated in this case, but constraint forces must be calculated. These forces are short range and should scale as O(N); their calculation requires a large number of short messages to be sent, and some latency effects become appreciable.

Next Leap in Supercomputer Power

- PetaFLOP : 10 15 floating point operations/sec
- Expected multiple PFLOP(s) machines in the US during 2008 - 2011
- NSF, DOE (ORNL, LANL, NNSA) are considering this
- Similar initiative in Japan, Europe

Topic

- Supercomputing in General
- Supercomputers at SDSC
- Eye Towards Petascale Computing

SDSC’s focus: Apps in top two quadrants

Climate

SCEC

Post-processing

SCEC

Simulation

ENZO

simulation

EOL

NVO

ENZO

Post-precessing

Turbulence

field

Cypres

CFD

Gaussian

CHARMM

CPMD

QCD

Turbulence

Reattachment

length

Protein

Folding

Data Storage/Preservation Env

Extreme I/O Environment

SDSC Data Science Env

- Time Variation of Field Variable Simulation
- Out-of-Core

Data(Increasing I/O and storage)

Campus, Departmental and Desktop Computing

Traditional HEC Env

Compute (increasing FLOPS)

SDSC Production Computing Environment25TF compute, 1.4PB disk, 6PB tape

TeraGrid Linux Cluster

IBM/Intel IA-64

4.4 TFlops

DataStar

IBM Power4+

15.6 TFlops

Blue Gene Data

IBM PowerPC

5.7 TFlops

Storage Area Network Disk

1400 TB

Archival Systems

6PB capacity (~3PB used)

Sun F15K

Disk Server

DataStar is a powerful compute resource well-suited to “extreme I/O” applications

- Peak speed 15.6 TFlops
- #44 in June 2006 Top500 list
- IBM Power4+ processors (2528 total)
- Hybrid of 2 node types, all on single switch
- 272 8-way p655 nodes:
- 176 1.5 GHz proc, 16 GB/node (2 GB/proc)
- 96 1.7 GHz proc, 32 GB/node (4 GB/proc)
- 11 32-way p690 nodes: 1.3 and 1.7 GHz, 64-256 GB/node (2-8 GB/proc)
- Federation switch: ~6 msec latency, ~1.4 GB/sec pp-bandwidth
- At 283 nodes, ours is one of the largest IBM Federation switches
- All nodes are direct-attached to high-performance SAN disk , 3.8 GB/sec write, 2.0 GB/sec read to GPFS
- GPFS now has 125TB capacity
- 226 TB of gpfs-wan across NCSA, ANL

- Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes & increased GPFS storage from 60 ->125TB
- - Enables 2048-processor capability jobs
- ~50% more throughput capacity
- More GPFS capacity and bandwidth

BG System Overview:Novel, massively parallel system from IBM

- Full system installed at LLNL from 4Q04 to 3Q05
- 65,000+ compute nodes in 64 racks
- Each node being two low-power PowerPC processors + memory
- Compact footprint with very high processor density
- Slow processors & modest memory per processor
- Very high peak speed of 367 Tflop/s
- #1 Linpack speed of 280 Tflop/s
- 1024 compute nodes in single rack installed at SDSC in 4Q04
- Maximum I/O-configuration with 128 I/O nodes for data-intensive computing
- Systems at 14 sites outside IBM & 4 within IBM as of 2Q06
- Need to select apps carefully
- Must scale (at least weakly) to many processors (because they’re slow)
- Must fit in limited memory

SDSC was first academic institution with an IBM Blue Gene system

SDSC procured 1-rack system 12/04. Used initially for code evaluation and benchmarking; production 10/05. (LLNL system is 64 racks.)

SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 GBps/rack potential.

SDSC Blue Gene - a new resource

- In Dec ‘04, SDSC brought in a single-rack Blue Gene system
- - Initially an experimental system to evaluate NSF applications on this unique architecture
- Tailored to high I/O applications
- Entered production as allocated resource in October 2005

- First academic installation of this novel architecture
- Configured for data-intensive computing
- 1,024 compute nodes, 128 I/O nodes
- Peak compute performance of 5.7 TFLOPS
- Two 700-MHz PowerPC 440 CPUs, 512 MB per node
- IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth
- I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN
- Has own GPFS of 20 TB and gpfs-wan
- System targets runs of 512 CPUs or more
- Production in October 2005
- Multiple 1 million-SU awards at LRAC and several smaller awards for physics, engineering, biochemistry

BG System Overview: Processor Chip (2)(= System-on-a-chip)

- Two 700-MHz PowerPC 440 processors
- Each with two floating-point units
- Each with 32-kB L1 data caches that are not coherent
- 4 flops/proc-clock peak (=2.8 Gflop/s-proc)
- 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc)
- Shared 2-kB L2 cache (or prefetch buffer)
- Shared 4-MB L3 cache
- Five network controllers (though not all wired to each node)
- 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways)
- Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways)
- Global interrupt (for MPI_Barrier: low latency)
- Gigabit Ethernet (for I/O)
- JTAG (for machine control)
- Memory controller for 512 MB of off-chip, shared memory

Strategic Applications Collaborations

- Cellulose to Ethanol : Biochemistry (J. Brady, Cornell)
- LES Turbelence : Mechanics (M. Krishnan, U. Minnesota)
- NEES : Earthquake Engr (Ahmed Elgamal, UCSD)
- ENZO : Astronomy (M. Norman, UCSD)
- EM Tomography : Neuroscience (M. Ellisman, UCSD)
- DNS Turbulence : Aerospace Engr (PK Yeung, Georgia Tech)
- NVO Mosaicking: Astronomy (R. Williams, Caltech, Alex Szalay, Johns Hopkins)
- UnderstandingPronouns: Linguistics (A. Kehler, UCSD)
- Climate: Atmospheric Sc. (C. Wunsch, MIT)
- Protein Structure:Biochemistry (D. Baker, Univ. of Washington)
- SCEC, TeraShake : Geological Science (T. Jordan and C. Kesselman USC, K. Olsen UCSB, B. Minster, SIO)

Topic

- Supercomputing in General
- Supercomputers at SDSC
- Eye Towards Petascale Computing

3.1 Petascale Hardware

3.2 Petascale Software

NERSC director Horst Simon (few days ago)

When I talk about petaflop computing, what I have in mind is the longer-term perspective, the time when the HPC community enters the age of petascale computing.

What I mean is the time when you must achieve petaflop Rmax performance to make the TOP500 list. An intriguing question is, when will this happen?

If you do a straight-line extrapolation from today's TOP500 list, you come up with the year 2016. In any case, it's eight to 10 years from now, and we will have to master several challenges to reach the age of petascale computing.

The Memory Wall

Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

Number of processors in the most highly parallel system in the TOP500

IBM BG/L

ASCI RED

Intel Paragon XP

Petascale Power Problem (Horst Simon)

- Power consumption is really pushing the environment most of the computing centers have
- A peak-petaflop Cray XT3 or cluster would need 8-9 megawatts for the computer alone
- The 2011 HPCS sustained petaflop systems would require about 20 megawatts
- Efficient power solutions needed Blue Gene is better, but it still requires high megawatts for a petaflop system
- At 10 cents per kilowatt-hour cost a 20-megawatt system would cost $12 million or more a year just for electricity
- Need to exploit different processor curves, such as the low-cost processors used in embedded technology – the Cell processor comes from low-end, embedded game technology - has great potential, there is a huge step from initial assessment to a production solution
- Don’t forget Space problem, MTBF

Applications and HPCC(next 7 slides from Rolf Rabenseifner, U. of Stuttgart)

high

PTRANS

STREAM

HPL

DGEMM

CFD

Radar X section

Applications

Spatial locality

DSP

TSP

RANDOM

ACCESS

FFT

low

Temporal locality

high

Balance Analysis of Machines with HPCC

- Balance expressed as a set of ratios
- Normalized by CPU speed (HPL Tflop/s rate)
- Basis
- Linpack (HPL): Computational Speed
- Parallel STREAM Copy or Triad:Memory bandwidth
- Random Ring Bandwidth: Inter-node communication
- FFT: low spatial and high temporal locality
- PTRANS: total communication capacity of network

Balance of Today’s Machines

- Today, balance factors are in a range of
- 20 inter-node communication / HPL-TFlop/s
- 10 memory speed / HPL-TFlop/s
- 20 FFTE / HPL-TFlop/s
- 30 PTRANS / HPL-TFlop/s

A Petscale Machine

- 10 GFLOP 100,000 procs
- Higher GFLOP machine – less processors (< 100K)
- Lower GFLOP machine - more processors ( > 1000K)
- Commodity Processors
- Heterogeneous processors
- Clearspeed card, graphic cards, FPGA
- Cray Adaptive Supercomputing
- combine standard microprocessors (scalar processing), vector processing, multithreading and hardware accelerators in one high-performance computing platform
- Sony – Toshiba – IBM Cell processors
- Memory, interconnect, I/O performance should scale in a balanced way with CPU speed

ORNL Petascale Roadmap

- ORNL will reach peak petaflops performance in stages, now through 2008:
- 2006: upgrade 25-teraflopsCray XT3 (5294 nodes, each with a 2.4-GHz AMD Opteron processor and 2 GB of memory ) system to 50 teraflops via dual-core AMD Opteron™ processors
- Late 2006: move to 100 teraflops with system codenamed "Hood"
- Late 2007: upgrade "Hood" to 250 teraflops
- Late 2008: move to peak petaflops with new architecture codenamed "Baker“
- Cray Adaptive Supercomputing - Powerful compilers and other software will automatically match an application to the processor blade that is best suited for it.

Parallel Applications

- Higher level domain decomposition of some sort or embarrassingly parallel types (astro/physics, engr, chemistry, CFD, MD, climate, materials)
- Mid level parallel math libraries (linear system solvers, FFT, random# generators etc.)
- Lower level search/sort algorithms , other computer science algorithms

Application Scaling

- Performance characterization and prediction of apps – computer science approach
- Scaling current apps for petascale machines – computational science approach
- Developing petascale apps/algorithms – numerical methods approach
- New languages for petascale applications

1. Performance Characterization/Prediction – computer science approach

- Characterizing and understanding current applications' performance (next talk by Pfeiffer)
- How much time is spent in memory/cache access and access pattern
- How much time spent in communication and what kind of communication pattern involved i.e. processor to processor communication or global communication operations where all the processors participate or both of these
- How much time is spent in I/O, I/O pattern
- Understanding these will allow us to figure out the importance/effect of various parts of a supercomputer on the application performance

Application Signature: Operations needed to be carried out by the application collecting: number of op1, op2, and op3

Machine Profile:

Rate at which a machine can perform different operations collecting: rate op1, op2, op3

Convolution: Mapping of a machines performance (rates) to applications needed operations

where operator could be + or MAX depending on operation overlap

Execution time = operation1operation2operation3

rate op1 rate op2 rate op3

Performance Modeling & Characterization- PMaC lab at SDSC (www.sdsc.edu/PMaC)

2. Scaling Current Apps – computational science approach

- At another level we need to understand what kind of algorithms and numerical methods are used and how those will scale or if we need to go to different approach for scaling improvement
- P.K. Yeung's DNS code example next slide (detail talk on this Wednesday morning: Yeung, Pekurovsky)
- Example of domain decomposition algorithmic level modification for scaling towards a petaflop machine
- One can do these types of analysis for all the computational science fields (molecular dynamics, climate/atmos models, CFD turbulence, astrophysics, QCD, fusion etc. etc.)
- May be some already has the optimal algorithm and will scale to a petascale machine (this is being very optimistic) and will now be able to solve a bigger higher resolution problem

DNS Problem

- DNS study of turbulence and turbulent mixing
- 3D space is decomposed in 1 dimension among processors
- 90% of time spent in 3D FFT routines
- Limited scaling up to 2048 processors solving problems up to 2048^3 (N=2048 girds)
- Number of processors limited by the linear problem size (N) due to 1D decomposition
- Would like to scale to many more processors to study problems 4096^3 and larger, using IBM Blue Gene and future (sub)petascale architectures
- Solution: decompose in 2D - max processor N^2
- For 4096^3 can use max of 16,777,216 processors
- There has to be need and scaling

3. Develop Petascale Apps/Algorithms – numerical methods approach

- Develop new algorithms/numerics from scratch for a particular field keeping in mind that now we will have (say) 100,000 processor machine
- When the original algorithm/code was implemented researchers were thinking of few 100s or 1000 processors
- Climate models using spectral element methods provide lot higher scalability due to less communication overhead, better cache performance etc. associated with the fundamental characteristics of spectral element numerical methods (Friday morning talk: Amik St-Cyr from NCAR)
- So climate researchers have moved to develop parallel codes using this kind of numercal methods for last few years expecting petaflop types machines will have large number of processors

Spectral Elements

- SPECTRAL ELEMENTS: Best approach consists into using quadrilaterals. You can write 1,2 and 3D operators on each element as Matrix-Matrix operations O(N^3) (2D) for tensor forms and O(N^4) (2D) for non-tensor forms (eg triangles with high-order basis). It is possible to use a tuned blas-level-3 call. There is no assembly of the matrix. Instead, the action of the assembled matrix on a vector is coded. The quadrature rules are used in a way that the mass matrix is diagonal and therefore trivially invertible
- FEM: In low order finite-elements, the nice MxM operations are not there, assembly of the global matrix is necessary and leads to issues of load balancing: the parallel matrix might be distributed differently than the actual element data. Also, the mass matrix is not diagonal and its inversion is necessary even in the case of explicit time-stepping
- (Pseudo) SPECTRAL: The problem with the global spectral transform is that the transposition of a (huge) array of data is necessary (enormous all-2-all type of communication). The spectral-element approach uses only nearest neighbors communications (locally). Eventually, the network bandwidth/contention will limit the scaling. They are also constrained to a certain type of domain: periodic domains are good candidates. For ocean modelers, it is not possible to use a global pseudo spectral method.
- Suppose N unknowns on a sphere (2D):
- Spherical harmonics (natural global spectral basis on the sphere) cost O(N^(3/2)) per time step
- Discrete Fourier transform cost O(N log N) per time step
- Spectral elements: O(N)

4. New Languages for Petascale Applications

- Can we write, just as we do today, codes in fortran, C, C++ and use MPI and effectively use petascale machines
- New languages provide the means to write codes as if the machine has a shared memory appearance, and write codes at a lot higher level and let these languages, libraries, do the lower level MPI type work
- PGAS (Partitioned Global Address Space) languages, compilers are Co-Array Fortran, UPC (Unified Paralle C),Titanium (ask Harkness - Tuesday morning talk)

Summary: Scaling Scaling and Scaling

- Balanced scalability in hardware (memory performance, interconnect performance, I/O performance, CPU performance) – vendors and centers’ problem
- Scalability in software – mostly your problem

Thank you

Download Presentation

Connecting to Server..