Overview of HPC SDSC Machines Science Enabled at SDSC

Overview of HPC SDSC MachinesScience Enabled at SDSC Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center University of California San Diego

Topics • Supercomputing in General • Supercomputers at SDSC • Science Enabled at SDSC

DOE, DOD, NASA, NSF Centers in US • DOE National Labs - LANL, LNNL, Sandia • DOE Office of Science Labs – ORNL, NERSC • DOD, NASA Supercomputer Centers • National Science Foundation supercomputer centers for academic users • San Diego Supercomputer Center (UCSD) • National Center for Supercomputer Applications (UIUC) • Pittsburgh Supercomputer Center (Pittsburgh) • Texas Advanced Computing Center (U. Texas) • Indiana-Purdue • ANL-Chicago

Buffalo Wisc Cornell Utah Iowa Caltech USC-ISI UNC-RENCI TeraGrid: Integrating NSF Cyberinfrastructure UC/ANL PU NCAR PSC IU NCSA ORNL SDSC TACC TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

Measure of Supercomputers • Top 500 list (HPL code performance) • Is one of the measures, but not the measure • Japan’s Earth Simulator (NEC) was on top for 3 years • In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOP on HPL, 367 TFLOP peak • First 100 TFLOP sustained on a real application last year • Very recently 200+ TFLOP sustained on a real application • New HPCC benchmarks • Many others – NAS, NERSC, NSF, DOD TI06 etc. • Ultimate measure is usefulness of a center for you – enabling better or new science through simulations on balanced machines

Top500 Benchmarks • 27thTop 500 – June 2006 • NSF Supercomputer Centers in Top500

Historical Trends in Top500 • 1000 X increase in top machine power in 10 years

Other Benchmarks • HPCC – High Performance Computing Challenge benchmarks – no rankings • NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered) • DoD HPCMP – TI06 benchmarks

Kiviat diagrams

Capability Computing Full power of a machine is used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models Capacity Computing Modest problems are tackled, often simultaneously, on a machine, each with less demanding requirements Smaller or cheaper systems are used for capacity computing, where smaller problems are solved Parametric studies or to explore design alternatives The main figure of merit is sustained performance per unit cost

Strong Scaling For a fixed problem size how does the time to solution vary with the number of processors Run a fixed size problem and plot the speedup When scaling of parallel codes is discussed it is normally strong scaling that is being referred to Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count Deviations from this indicate that either The algorithm is not truly O(N) or The overhead due to parallelism is increasing, or both

Weak Vs Strong Scaling Examples • The linked cell algorithm employed in DL_POLY 3 [1] for the short ranged forces should be strictly O(N) in time. • Study the weak scaling of three model systems (two shown next), the times being reported for HPCx, a large IBM P690+ cluster sited at Daresbury. • http://www.cse.clrc.ac.uk/arc/dlpoly_scale.shtml • I.J.Bush and W.Smith, CCLRC Daresbury Laboratory

Weak scaling for Argon is shown. The smallest system size is 32,000 atoms, the largest 32,768,000. It can be seen that the scaling is very good, the time step increasing from 0.6s to 0.7s on going from 1 processor to 1024. This simulation is a direct test of the linked cell algorithm as it only requires short ranged forces, and so the results show it is behaving as expected.

Weak scaling for water. The time step increasing from 1.9 second on 1 processor, where the system size is 20,736 particles, to 3.9 on 1024 ( system size 21,233,664 ). Ewald terms must also be calculated in this case, but constraint forces must be calculated. These forces are short range and should scale as O(N); their calculation requires a large number of short messages to be sent, and some latency effects become appreciable.

Next Leap in Supercomputer Power • PetaFLOP : 10 15 floating point operations/sec • Expected multiple PFLOP(s) machines in the US during 2008 - 2011 • NSF, DOE (ORNL, LANL, NNSA) are considering this • Similar initiative in Japan, Europe

Topic • Supercomputing in General • Supercomputers at SDSC • Science Enabled at SDSC

SDSC’s focus: Apps in top two quadrants Climate SCEC Post-processing SCEC Simulation ENZO simulation EOL NVO ENZO Post-precessing Turbulence field Cypres CFD Gaussian CHARMM CPMD QCD Turbulence Reattachment length Protein Folding Data Storage/Preservation Env Extreme I/O Environment SDSC Data Science Env • Time Variation of Field Variable Simulation • Out-of-Core Data(Increasing I/O and storage) Campus, Departmental and Desktop Computing Traditional HEC Env Compute (increasing FLOPS)

SDSC Production Computing Environment25TF compute, 1.4PB disk, 6PB tape TeraGrid Linux Cluster IBM/Intel IA-64 4.4 TFlops DataStar IBM Power4+ 15.6 TFlops Blue Gene Data IBM PowerPC 2X5.7 TFlops Storage Area Network Disk 1400 TB Archival Systems 18PB capacity (~3.5PB used) Sun F15K Disk Server

DataStar is a powerful compute resource well-suited to “extreme I/O” applications • Peak speed 15.6 TFlops • #44 in June 2006 Top500 list • IBM Power4+ processors (2528 total) • Hybrid of 2 node types, all on single switch • 272 8-way p655 nodes: • 176 1.5 GHz proc, 16 GB/node (2 GB/proc) • 96 1.7 GHz proc, 32 GB/node (4 GB/proc) • 11 32-way p690 nodes: 1.7 GHz, 64-256 GB/node (2-8 GB/proc) • Federation switch: ~6 msec latency, ~1.4 GB/sec pp-bandwidth • At 283 nodes, ours is one of the largest IBM Federation switches • All nodes are direct-attached to high-performance SAN disk , 3.8 GB/sec write, 2.0 GB/sec read to GPFS • GPFS now has 115TB capacity • 225 TB of gpfs-wan across NCSA, ANL • Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes & increased GPFS storage from 60 ->125TB • - Enables 2048-processor capability jobs • ~50% more throughput capacity • More GPFS capacity and bandwidth

BG System Overview:Novel, massively parallel system from IBM • Full system installed at LLNL from 4Q04 to 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 367 Tflop/s • #1 Linpack speed of 280 Tflop/s • 1024 compute nodes in single rack installed at SDSC in 4Q04 • Another 1024 compute nodes will be installed soon • Maximum I/O-configuration with 128 I/O nodes/rack for data-intensive computing • Systems at 14 sites outside IBM & 4 within IBM as of 2Q06 • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory

SDSC was first academic institution with an IBM Blue Gene system SDSC procured 1-rack system 12/04. Used initially for code evaluation and benchmarking; production 10/05. (LLNL system is 64 racks.) Another node will be installed soon SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 GBps/rack potential.

SDSC Blue Gene - a new resource • In Dec ‘04, SDSC brought in a single-rack Blue Gene system • - Initially an experimental system to evaluate NSF applications on this unique architecture • Tailored to high I/O applications • Entered production as allocated resource in October 2005 • First academic installation of this novel architecture • Configured for data-intensive computing • 1,024 compute nodes (soon to be 2048) , 128 I/O nodes • Peak compute performance of 5.7 TFLOPS (soon will be 11.4 TFLOPS) • Two 700-MHz PowerPC 440 CPUs, 512 MB per node • IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth • I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN • Has own GPFS of 20 TB and gpfs-wan • System targets runs of 512 CPUs or more • Production in October 2005 • Multiple 1 million-SU awards at LRAC and several smaller awards for physics, engineering, biochemistry

BG System Overview: Processor Chip (1)

BG System Overview: Processor Chip (2)(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are not coherent • 4 flops/proc-clock peak (=2.8 Gflop/s-proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory

DataStar p655 Usage, by Node Size

SDSC Academic Use, by Directorate

Strategic Applications Collaborations • Cellulose to Ethanol : Biochemistry (J. Brady, Cornell) • LES Turbelence : Mechanics (M. Krishnan, U. Minnesota) • NEES : Earthquake Engr (Ahmed Elgamal, UCSD) • ENZO : Astronomy (M. Norman, UCSD) • EM Tomography : Neuroscience (M. Ellisman, UCSD) • DNS Turbulence : Aerospace Engr (PK Yeung, Georgia Tech) • NVO Mosaicking: Astronomy (R. Williams, Caltech, Alex Szalay, Johns Hopkins) • UnderstandingPronouns: Linguistics (A. Kehler, UCSD) • Climate: Atmospheric Sc. (C. Wunsch, MIT) • Protein Structure:Biochemistry (D. Baker, Univ. of Washington) • SCEC, TeraShake : Geological Science (T. Jordan and C. Kesselman USC, K. Olsen UCSB, B. Minster, SIO)

Topic • Supercomputing in General • Supercomputers at SDSC • Science Enabled at SDSC

Enabling Users – User Centric Focus of SDSC • SDSC leadership in enabling users • Recruit new users/communities • Enable their HPC • Help write allocations proposals • Make recruited users 100K – 1000K allocated SU users • (D. Baker/U.Washington, C. Wunsch/MITgcm, Mark Ellisman/BIRN, NEES, PK Yeung/Georgia Tech, M. Krishnan/U. Minn, K. Droegemeier, GEON etc.) • Balanced machine – memory/node, I/O, queue management – these attract users and retain users • Work on community users, and comm/3rd party codes, tools, libs • Procure machines based on needs and characteristics of users’ codes

SAC Program and Science Enabled • Achieve breakthrough computational science that users couldn’t do before • Pair up SDSC’s computational scientists (many disciplines of domain science and parallel computing expert) with NSF PIs for 3-12 months • Span all the NSF directorates and universities across US for SAC projects • Scaling (procs/communication and I/O) up applications is a major thrust • Develop and apply solutions for wider user community

Scaling DNS Turbulence(PI: Dr. P.K.Yeung, Georgia Tech, SAC staff Dr. Dmitry Pekurovsky) • Original DNS code used for years to simulate a range of phenomena in turbulence and turbulent mixing • Over the years PI had millions of allocated SUs on SDSC and other NSF center’s machines • Currently computing at 2048^3 resolution • Would like to reach the grid size done on the Earth Simulator i.e. 4096^3 resolution, to better understand physics at micro scales • Original code is limited in scalability by N (4096) processors for N^3 grid problem

2-D Parallel Decomposed Code • Reimplemented in 2-D parallel decomposition of the compute-intensive part (3D FFT) • Now capable of scaling up to N2 processors (16M) • New code successfully tested and running on 32,768 BG processors at IBM Watson lab (4096^3 – first ever attemted in US) • By-product: optimized library for scalable 3D FFT, for use in other codes. Beta version available at SDSC Web site. Currently using the library in another turbulence code, as part of another SAC project. The execution speed (# of steps per second of execution), normalized by the problem size, is plotted on the Y-axis.

SDSC Enables “CASP in 3 Hours” to Speed Simulations for Drug Design(SDSC recruited, parallelized code; millions of SUs)(PI David Baker U. Washington; SAC staff Dr. Ross Walker) SDSC Blue Gene Work horse for CASP 7 competition. Provided access to an order of magnitudemore computing power than was availablefor CASP 6. Only NSF machine available that couldprovide a job “turnaround” (Queue+Runtime)of less than 1 week for all CASP targets. Test bed for “extreme scaling” modifications. Provided a development environment to successfullyscale HHMI Professor Baker’s Rosetta code toover 40,000 processors using IBM TJWatson BlueGene System SDSC Datastar Used for the 10% of CASP targets that requireda large memory footprint. 2000+ cpu jobs possible for large structureprediction problems. Image shows the blind prediction (Blue) of a CASP7 target. Red shows the x-ray structure (released after the prediction was submitted) and Green shows a low resolution NMR structure. The prediction was performed by Ross Walker (SDSC) and Srivatsan Raman (UW) in an unprecedented 3 hours using 40,960 cpus of IBM TJWatson Blue Gene/L Machine. Such a calculation was only possible from the experience learnt via the SDSC SAC collaboration.

SDSC SAC Group Improves Charmm Scaling for Cellulase Research (PIs from TSRI, NREL, Cornell; SAC staff Dr. Ross Walker) • Cellulase: key enzyme in the production of cellulistic ethanol. • Opportunity to reduce the USA’s dependence on foreign oil. • True molecular machine. • 1 million atom+ simulations need highperformance capability computing. • Datastar is the perfect platform for this. • SDSC (Ross Walker) is working on improving the performance and scaling of the CHARMM MD code. • Improvements will ultimately benefit thousands of researchers.

NEES(PIs – Iowa St., Stanford,Princeton,U.Missouri,UC Berkeley, Davis etc.; SAC Staff Dr. Dong Ju Choi) • SDSC recruited NESS for HPC usage • and wrote successful allocation proposal • Network for Earthquake Engineering Simulation (NEES) is an NSF-funded MRE project. • Provides world-class experimental facilities, coordinated IT (NEESit), data, networking and computational support, including HPC simulation support, to the NEES community. • SDSC SAC staff is working with NEESit and NEES scientists to optimize code performance and scalability and to enable HPC for NEES community (new) users Shaketable Viz done by SDSC Viz group –Steve Cutchin, and Amit Chourasia

NEES SAC • OpenSees (core structural finite element object oriented code for the NEES community): scalability is much improved using the various parallel solver algorithm (Petsc, MUMPS, Distributed Super LU) and different communication scheme • Recently demonstrated 2048 DataStar processor runs for 25 million elements with good scalability and single PE performance on Puente Hills earthquake simulation (originally code was modeling 1 million elements on few procs) • 13 sub-PIs (over 30 new users) are new to HPC but using the DataStar and the TG ia64 through the NEES HPC allocation as a type of community allocation • Users are using their improved code and/or existing structural/fluid codes (OpenSees, LS-Dyna, Abaqus, Ansys, Fluent etc.) and resulted significant increaes in HPC usage • Designed and developed a utility for parametric runs and worked with the users to successfully complete the jobs

ENZO SAC - Scaling & Optimization(PI: Mike Norman, UCSD, SAC staff : Dr. Robert Harkness) (SDSC contributes to writing alloc prop) • NonAMR – Lyman Alpha Forest simulation – compare results of the simulation based on concordance model of cosmology with observation to constrain cosmological parameters. • AMR1 – cluster of galaxies, x-ray emmisivity – comparison with x-ray obs.AMR2 – The AMR “light cone” simulations to support the construction of the LSST (Large scale Synoptic Survery Telescope) • ENZO problem sizes increased by ~8^3, cost ~8^4 in 3 years – expect a further increase of ~8^3 in 3 years • Today non-AMR grids up to 2048^3 with 8 billion dark matter particles possible on 2048 cpus of DataStar compared to 256^3 grids on about 64 processors a few years ago – this is result of SDSC SAC effort • AMR 512^3 top-level grids with 7 levels of refinement, including 512^3 dark matter particles, generating > 350,000 subgrids(SAC effort resulted in N^2 to NlogN scaling improvement) • Shared-memory parallelism used in initial conditions generator • Massively parallel dark matter particle sort enables 100% parallel I/O • Weak scaling shows linear behavior up to 2048 cpus • Strong scaling limited by ghost cells and boundary exchanges

ENZO – New physics and enhanced scaling • ENZO will incorporate MHD and 3D flux-limited diffusion • Advanced parallel multigrid solvers for gravity and RT • Refactoring of AMR grid hierarchy for unlimited scaling • Gadget equilibrium cooling • Unigrid scale up to 4096^3 and 8192^3 at Petascale on 16K to 64K processors • 2048^3 L6 AMR at Petascale • I/O strategies for managing multi-Petabyte results • Integrated visualization, steering and tracking 2048^3 LAF on 2048 CPUs of DataStar (only NSF machine capable of this – 5TB memory required)

Before SDSC SAC involved Code deals up to 56 million mesh Code scales up to 512 processors Ran on local clusters only No checkpoints/restart capability Wave propagation simulation only K. Olsen’s own code Poor single-processor performance Initialization slow and memory problems MPI-I/O bugs, not scalable After SDSC SAC efforts Codes enhanced to deal with 8.6 billion mesh Excellent speed-up to 2048 processors, achieve 1 Tflop/s Ported to Datastar, BG/L, TG IA-64, Lemieux etc Added Checkpoints/restart/checksum capability Integrated dynamic rupture + wave propagation as one Serve as SCEC Community Velocity Model 4x speed-up of single-processor performance 10x speed-up of initialization and memory needs reduced MPI-I/O improved 10x, generating 47TB outputs per run SDSC SAC TeraShake Efforts(PIs : Tom Jordan, USC, K. Olsen SDSU, B. Minster, SIO; SAC Staff Dr. Yifeng Cui) (SDSC helps in allocation proposal)

Real World Engr Flows – PI: Mahesh Krishnan, U. Minnesota (contributed to acquiringmillion SUs) • Numerical methods and turbulence models that handle real-world engineering geometries without compromising the accuracy needed to reliably simulate the complicated details of turbulence • DNS of turbulent jet in cross flow : 12 million control volumes (CV), 144 DS procs • Propeller crashback : 13 million CV, 384 TG procs, Re ~480,000 • Spatially evolving turbulent round jet : today : ~50 million CV (unstructured) on 1024 DataStar procs, Re ~2400 yesterday: ~6.5 million CV on 160 DataStar procs, ~Re 1000 • Fourier Spectral code runs on Blue Gene – SDSC SAC effort onging for memory scaling Simulation of flow around a propeller in sudden reversal known as crashback. Flow is left to right and shows streamlines and pressure contours in the cross-section An exact simulation, without approximations, of a turbulent jet using DNS. www.aem.umn.edu/~mahesh/forsdsc/jic_vort.avi

SDSC Enables Accurate Simulation of Sun’s Corona (PI: Chuck Goodrich, BC, Z. Mikic, SAIC) • The most true-to-life computer simulation ever made of our sun's multimillion-degree outer atmosphere, the corona, successfully predicted its actual appearance during the total solar eclipse of March 29, 2006 • The demanding calculations required four days running on more than 600 processors of the DataStar system at the SDSC • Computer model based on spacecraftobservations of magnetic activity • More realistic physics of how enerygyis transferred in the corona • PMaC (Allan Snavely, Nick Wright) group involved in scaling work A composite of observations of the eclipse. Solar north is up. Solar Physics Group, SAIC; Williams College Eclipse Expedition with support from NSF/NASA/National Geographic, and SOHO, supported by NASA and ESA

Longest-ever Simulation of Type Ia SupernovaAlexei Khokhlov, Don Lamb – U. Chicago • The first self-consistent 3-D numerical simulation of the Type Ia supernova deflagration explosion from the moment of ignition through the active explosion phase and followed up to the period of 11 days • The current state of the art multidimensional models of such astrophysical phenomena have typically followed the evolution of the system for a few tens of seconds • Post-explosion evolution of Type Ia supernova lasts for much longer periods of time going through various stages with different physical processes being important at different stages • On 512 DS processors - total SU usage in August was ~30,000; Overall SU, included development & testing of the numerical code, was ~200,000 SUs 1 – totallyburned Flame structure in the star at 2 sec and at 77 min 0 – unburned or fuel

Estimating the State of the Southern OceanCarl Wunsch, Matt Mazloff, MIT, ECCO Consort.(recruited this group and enabled to get million SUs) • “The ECCO group faces a computationally massive problem that is only feasible thanks to computing centers like the SDSC.” Matt Mazloff, MIT • Diagnosing and evaluating the state of the Southern Ocean • Global ocean circulation imactclimate change, ocean currents affect fisheries dynamics, shipping, offshore mining, sea level height change, sea surface temperatures, storm development, seasonal droughts and floods • Key adjoint method used in the MITgcm code – balanced machine, 4 GB/procs, good I/O vital • Simulations on DataStar provided improved estimate of southern ocean for the year 2000 • Received about one million SUs last March on DataStar; short term goal is to improve year 2000 estimate and extend thru 2003

Thank you

Overview of HPC SDSC Machines Science Enabled at SDSC

Overview of HPC SDSC Machines Science Enabled at SDSC

Presentation Transcript

SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007

SDSC Data and Knowledge Systems

SDSC and CIEG Overview CIEG Workshop April, 2007

SDSC/UCSD Campus Update

Green Datacenter Initiatives at SDSC

NPACI/SDSC Security Activities

Running jobs on SDSC Resources

SDSC Imaging Portal

SDSC, skitter (July 1998)

SDSC S R B survey

Mining Large Data at SDSC

SDSC Blue Gene: Overview

Visualization at SDSC

Gridflows and SDSC Matrix

High End Computing at SDSC

Single-Processor Optimization Stuart Johnson, SDSC (sjohnson@sdsc)

Running jobs on SDSC Resources

SDSC RP Update

DataTurbine at SDSC

John L. Moreland moreland@sdsc

A Scientific Investigation of Science Instructional Materials SDSC TeacherTECH Program