Overview of HPC – Eye Towards Petascale Computing

Overview of HPC – Eye Towards Petascale Computing Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center University of California San Diego

Topics • Supercomputing in General • Supercomputers at SDSC • Eye Towards Petascale Computing

DOE, DOD, NASA, NSF Centers in US • DOE National Labs - LANL, LNNL, Sandia • DOE Office of Science Labs – ORNL, NERSC • DOD, NASA Supercomputer Centers • National Science Foundation supercomputer centers for academic users • San Diego Supercomputer Center (UCSD) • National Center for Supercomputer Applications (UIUC) • Pittsburgh Supercomputer Center (Pittsburgh) • Others at Texas, Indiana-Purdue, ANL-Chicago

Buffalo Wisc Cornell Utah Iowa Caltech USC-ISI UNC-RENCI TeraGrid: Integrating NSF Cyberinfrastructure UC/ANL PU NCAR PSC IU NCSA ORNL SDSC TACC TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

Measure of Supercomputers • Top 500 list (HPL code performance) • Is one of the measures, but not the measure • Japan’s Earth Simulator (NEC) was on top for 3 years • In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOP on HPL, 367 TFLOP peak • First 100 TFLOP sustained on a real application last year • Very recently 200+ TFLOP sustained on a real application • New HPCC benchmarks • Many others – NAS, NERSC, NSF, DOD TI06 etc. • Ultimate measure is usefulness of a center for you – enabling better or new science through simulations on balanced machines

Top500 Benchmarks • 27thTop 500 – June 2006 • NSF Supercomputer Centers in Top500

Historical Trends in Top500 • 1000 X increase in top machine power in 10 years

Other Benchmarks • HPCC – High Performance Computing Challenge benchmarks – no rankings • NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered) • DoD HPCMP – TI06 benchmarks

Kiviat diagrams

Capability Computing Full power of a machine is used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models Capacity Computing Modest problems are tackled, often simultaneously, on a machine, each with less demanding requirements Smaller or cheaper systems are used for capacity computing, where smaller problems are solved Parametric studies or to explore design alternatives The main figure of merit is sustained performance per unit cost

Strong Scaling For a fixed problem size how does the time to solution vary with the number of processors Run a fixed size problem and plot the speedup When scaling of parallel codes is discussed it is normally strong scaling that is being referred to Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count Deviations from this indicate that either The algorithm is not truly O(N) or The overhead due to parallelism is increasing, or both

Weak Vs Strong Scaling Examples • The linked cell algorithm employed in DL_POLY 3 [1] for the short ranged forces should be strictly O(N) in time. • Study the weak scaling of three model systems (two shown next), the times being reported for HPCx, a large IBM P690+ cluster sited at Daresbury. • http://www.cse.clrc.ac.uk/arc/dlpoly_scale.shtml • I.J.Bush and W.Smith, CCLRC Daresbury Laboratory

Weak scaling for Argon is shown. The smallest system size is 32,000 atoms, the largest 32,768,000. It can be seen that the scaling is very good, the time step increasing from 0.6s to 0.7s on going from 1 processor to 1024. This simulation is a direct test of the linked cell algorithm as it only requires short ranged forces, and so the results show it is behaving as expected.

Weak scaling for water. The time step increasing from 1.9 second on 1 processor, where the system size is 20,736 particles, to 3.9 on 1024 ( system size 21,233,664 ). Ewald terms must also be calculated in this case, but constraint forces must be calculated. These forces are short range and should scale as O(N); their calculation requires a large number of short messages to be sent, and some latency effects become appreciable.

Next Leap in Supercomputer Power • PetaFLOP : 10 15 floating point operations/sec • Expected multiple PFLOP(s) machines in the US during 2008 - 2011 • NSF, DOE (ORNL, LANL, NNSA) are considering this • Similar initiative in Japan, Europe

Topic • Supercomputing in General • Supercomputers at SDSC • Eye Towards Petascale Computing

SDSC’s focus: Apps in top two quadrants Climate SCEC Post-processing SCEC Simulation ENZO simulation EOL NVO ENZO Post-precessing Turbulence field Cypres CFD Gaussian CHARMM CPMD QCD Turbulence Reattachment length Protein Folding Data Storage/Preservation Env Extreme I/O Environment SDSC Data Science Env • Time Variation of Field Variable Simulation • Out-of-Core Data(Increasing I/O and storage) Campus, Departmental and Desktop Computing Traditional HEC Env Compute (increasing FLOPS)

SDSC Production Computing Environment25TF compute, 1.4PB disk, 6PB tape TeraGrid Linux Cluster IBM/Intel IA-64 4.4 TFlops DataStar IBM Power4+ 15.6 TFlops Blue Gene Data IBM PowerPC 5.7 TFlops Storage Area Network Disk 1400 TB Archival Systems 6PB capacity (~3PB used) Sun F15K Disk Server

DataStar is a powerful compute resource well-suited to “extreme I/O” applications • Peak speed 15.6 TFlops • #44 in June 2006 Top500 list • IBM Power4+ processors (2528 total) • Hybrid of 2 node types, all on single switch • 272 8-way p655 nodes: • 176 1.5 GHz proc, 16 GB/node (2 GB/proc) • 96 1.7 GHz proc, 32 GB/node (4 GB/proc) • 11 32-way p690 nodes: 1.3 and 1.7 GHz, 64-256 GB/node (2-8 GB/proc) • Federation switch: ~6 msec latency, ~1.4 GB/sec pp-bandwidth • At 283 nodes, ours is one of the largest IBM Federation switches • All nodes are direct-attached to high-performance SAN disk , 3.8 GB/sec write, 2.0 GB/sec read to GPFS • GPFS now has 125TB capacity • 226 TB of gpfs-wan across NCSA, ANL • Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes & increased GPFS storage from 60 ->125TB • - Enables 2048-processor capability jobs • ~50% more throughput capacity • More GPFS capacity and bandwidth

BG System Overview:Novel, massively parallel system from IBM • Full system installed at LLNL from 4Q04 to 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 367 Tflop/s • #1 Linpack speed of 280 Tflop/s • 1024 compute nodes in single rack installed at SDSC in 4Q04 • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing • Systems at 14 sites outside IBM & 4 within IBM as of 2Q06 • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory

SDSC was first academic institution with an IBM Blue Gene system SDSC procured 1-rack system 12/04. Used initially for code evaluation and benchmarking; production 10/05. (LLNL system is 64 racks.) SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 GBps/rack potential.

SDSC Blue Gene - a new resource • In Dec ‘04, SDSC brought in a single-rack Blue Gene system • - Initially an experimental system to evaluate NSF applications on this unique architecture • Tailored to high I/O applications • Entered production as allocated resource in October 2005 • First academic installation of this novel architecture • Configured for data-intensive computing • 1,024 compute nodes, 128 I/O nodes • Peak compute performance of 5.7 TFLOPS • Two 700-MHz PowerPC 440 CPUs, 512 MB per node • IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth • I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN • Has own GPFS of 20 TB and gpfs-wan • System targets runs of 512 CPUs or more • Production in October 2005 • Multiple 1 million-SU awards at LRAC and several smaller awards for physics, engineering, biochemistry

BG System Overview: Processor Chip (1)

BG System Overview: Processor Chip (2)(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are not coherent • 4 flops/proc-clock peak (=2.8 Gflop/s-proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory

DataStar p655 Usage, by Node Size

SDSC Academic Use, by Directorate

Strategic Applications Collaborations • Cellulose to Ethanol : Biochemistry (J. Brady, Cornell) • LES Turbelence : Mechanics (M. Krishnan, U. Minnesota) • NEES : Earthquake Engr (Ahmed Elgamal, UCSD) • ENZO : Astronomy (M. Norman, UCSD) • EM Tomography : Neuroscience (M. Ellisman, UCSD) • DNS Turbulence : Aerospace Engr (PK Yeung, Georgia Tech) • NVO Mosaicking: Astronomy (R. Williams, Caltech, Alex Szalay, Johns Hopkins) • UnderstandingPronouns: Linguistics (A. Kehler, UCSD) • Climate: Atmospheric Sc. (C. Wunsch, MIT) • Protein Structure:Biochemistry (D. Baker, Univ. of Washington) • SCEC, TeraShake : Geological Science (T. Jordan and C. Kesselman USC, K. Olsen UCSB, B. Minster, SIO)

Topic • Supercomputing in General • Supercomputers at SDSC • Eye Towards Petascale Computing 3.1 Petascale Hardware 3.2 Petascale Software

3.1Petascale Hardware

NERSC director Horst Simon (few days ago) When I talk about petaflop computing, what I have in mind is the longer-term perspective, the time when the HPC community enters the age of petascale computing. What I mean is the time when you must achieve petaflop Rmax performance to make the TOP500 list. An intriguing question is, when will this happen? If you do a straight-line extrapolation from today's TOP500 list, you come up with the year 2016. In any case, it's eight to 10 years from now, and we will have to master several challenges to reach the age of petascale computing.

The Memory Wall Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

Number of processors in the most highly parallel system in the TOP500 IBM BG/L ASCI RED Intel Paragon XP

Petascale Power Problem (Horst Simon) • Power consumption is really pushing the environment most of the computing centers have • A peak-petaflop Cray XT3 or cluster would need 8-9 megawatts for the computer alone • The 2011 HPCS sustained petaflop systems would require about 20 megawatts • Efficient power solutions needed Blue Gene is better, but it still requires high megawatts for a petaflop system • At 10 cents per kilowatt-hour cost a 20-megawatt system would cost $12 million or more a year just for electricity • Need to exploit different processor curves, such as the low-cost processors used in embedded technology – the Cell processor comes from low-end, embedded game technology - has great potential, there is a huge step from initial assessment to a production solution • Don’t forget Space problem, MTBF

Applications and HPCC(next 7 slides from Rolf Rabenseifner, U. of Stuttgart) high PTRANS STREAM HPL DGEMM CFD Radar X section Applications Spatial locality DSP TSP RANDOM ACCESS FFT low Temporal locality high

Balance Analysis of Machines with HPCC • Balance expressed as a set of ratios • Normalized by CPU speed (HPL Tflop/s rate) • Basis • Linpack (HPL): Computational Speed • Parallel STREAM Copy or Triad:Memory bandwidth • Random Ring Bandwidth: Inter-node communication • FFT: low spatial and high temporal locality • PTRANS: total communication capacity of network

Balance between memory and CPU speed

Balance between Random Ring BW (network BW) and CPU speed

Balance between Fast Fourier Transform (FFTE) and CPU Speed

Balance between Matrix Transpose (PTRANS) and CPU Speed

Balance of Today’s Machines • Today, balance factors are in a range of • 20 inter-node communication / HPL-TFlop/s • 10 memory speed / HPL-TFlop/s • 20 FFTE / HPL-TFlop/s • 30 PTRANS / HPL-TFlop/s

A Petscale Machine • 10 GFLOP 100,000 procs • Higher GFLOP machine – less processors (< 100K) • Lower GFLOP machine - more processors ( > 1000K) • Commodity Processors • Heterogeneous processors • Clearspeed card, graphic cards, FPGA • Cray Adaptive Supercomputing • combine standard microprocessors (scalar processing), vector processing, multithreading and hardware accelerators in one high-performance computing platform • Sony – Toshiba – IBM Cell processors • Memory, interconnect, I/O performance should scale in a balanced way with CPU speed

ORNL Petascale Roadmap • ORNL will reach peak petaflops performance in stages, now through 2008: • 2006: upgrade 25-teraflopsCray XT3 (5294 nodes, each with a 2.4-GHz AMD Opteron processor and 2 GB of memory ) system to 50 teraflops via dual-core AMD Opteron™ processors • Late 2006: move to 100 teraflops with system codenamed "Hood" • Late 2007: upgrade "Hood" to 250 teraflops • Late 2008: move to peak petaflops with new architecture codenamed "Baker“ • Cray Adaptive Supercomputing - Powerful compilers and other software will automatically match an application to the processor blade that is best suited for it.

3.2Petascale Software

Parallel Applications • Higher level domain decomposition of some sort or embarrassingly parallel types (astro/physics, engr, chemistry, CFD, MD, climate, materials) • Mid level parallel math libraries (linear system solvers, FFT, random# generators etc.) • Lower level search/sort algorithms , other computer science algorithms

Application Scaling • Performance characterization and prediction of apps – computer science approach • Scaling current apps for petascale machines – computational science approach • Developing petascale apps/algorithms – numerical methods approach • New languages for petascale applications

1. Performance Characterization/Prediction – computer science approach • Characterizing and understanding current applications' performance (next talk by Pfeiffer) • How much time is spent in memory/cache access and access pattern • How much time spent in communication and what kind of communication pattern involved i.e. processor to processor communication or global communication operations where all the processors participate or both of these • How much time is spent in I/O, I/O pattern • Understanding these will allow us to figure out the importance/effect of various parts of a supercomputer on the application performance

Application Signature: Operations needed to be carried out by the application collecting: number of op1, op2, and op3 Machine Profile: Rate at which a machine can perform different operations collecting: rate op1, op2, op3 Convolution: Mapping of a machines performance (rates) to applications needed operations where operator could be + or MAX depending on operation overlap Execution time = operation1operation2operation3 rate op1 rate op2 rate op3 Performance Modeling & Characterization • PMaC lab at SDSC (www.sdsc.edu/PMaC)

2. Scaling Current Apps – computational science approach • At another level we need to understand what kind of algorithms and numerical methods are used and how those will scale or if we need to go to different approach for scaling improvement • P.K. Yeung's DNS code example next slide (detail talk on this Wednesday morning: Yeung, Pekurovsky) • Example of domain decomposition algorithmic level modification for scaling towards a petaflop machine • One can do these types of analysis for all the computational science fields (molecular dynamics, climate/atmos models, CFD turbulence, astrophysics, QCD, fusion etc. etc.) • May be some already has the optimal algorithm and will scale to a petascale machine (this is being very optimistic) and will now be able to solve a bigger higher resolution problem

DNS Problem • DNS study of turbulence and turbulent mixing • 3D space is decomposed in 1 dimension among processors • 90% of time spent in 3D FFT routines • Limited scaling up to 2048 processors solving problems up to 2048^3 (N=2048 girds) • Number of processors limited by the linear problem size (N) due to 1D decomposition • Would like to scale to many more processors to study problems 4096^3 and larger, using IBM Blue Gene and future (sub)petascale architectures • Solution: decompose in 2D - max processor N^2 • For 4096^3 can use max of 16,777,216 processors • There has to be need and scaling

3. Develop Petascale Apps/Algorithms – numerical methods approach • Develop new algorithms/numerics from scratch for a particular field keeping in mind that now we will have (say) 100,000 processor machine • When the original algorithm/code was implemented researchers were thinking of few 100s or 1000 processors • Climate models using spectral element methods provide lot higher scalability due to less communication overhead, better cache performance etc. associated with the fundamental characteristics of spectral element numerical methods (Friday morning talk: Amik St-Cyr from NCAR) • So climate researchers have moved to develop parallel codes using this kind of numercal methods for last few years expecting petaflop types machines will have large number of processors

Overview of HPC – Eye Towards Petascale Computing

Overview of HPC – Eye Towards Petascale Computing

Presentation Transcript

Overview

Windows Kernel Internals II x86 overview Traps, Interrupts, Exceptions

SPP EIS Overview WestConnect

CCNA Curricula Overview

SAE AADL V2: An Overview

Acts 2:1-4

IBM PureFlex Systems IBM Flex System Overview

.NET and C# Overview

J2EE Overview

Chapter 1: Overview

Module 1 Overview of Evaluation

ALUI Features and Architecture Overview

An Overview of Statistical Machine Translation

IBM PureFlex Systems IBM Flex System Overview

Overview of Assistive Technology

caArray Overview

Overview of Financial Institutions

WORKERS’ COMPENSATION OVERVIEW OF NEW LAW

WORKERS’ COMPENSATION OVERVIEW OF NEW LAW

Public Assistance Program (Overview)