Supercomputing: Directions in Technology, Architecture and Applications

Keynote Talk to Supercomputer’98 in Mannheim, Germany June 18, 1998 Supercomputing: Directions in Technology, Architecture and Applications

Supercomputing: Directions in Technology, Architecture and Applications Abstract "By using the results of the Top 500 over the last five years, one can easily trace out the complete transformation of the supercomputer industry. In 1993, none of the Top500 was made by a broadly based market driven company, while today over 3/4 of the Top500 are made by SGI, IBM, HP, or Sun. Similarly, vector architectures have been replaced in market share by microprocessor based SMPs. We now see a strong move to replace many MPPs and SMPs by the new architecture of Distributed Shared Memory (DSM) such as the SGI Origin or HP SPP series. A key trend is the move toward clusters of DSMs instead of monolithic MPPs. The next major change will be the emergence of Intel processors replacing RISC processors, particularly the Intel Merced processor which should become dominant shortly after 2000. A major battle will shape up between UNIX and Microsoft's NT operating systems, particularly at the lower end of the Top500. Finally, with each new architecture comes a new set of applications we can now attack. I will discuss how DSM will enable dynamic load balancing needed to support the multi-scale problems that teraflop machines will enable us to tackle."

NCSA is the Leading Edge Site for the National Computational Science Alliance www.ncsa.uiuc.edu

= Recent Computations by NSF Grand Challenge Research Teams = Next Step Projections by NSF Grand Challenge Research Teams = Long Range Projections from Recent Applications Workshop 1014 Turbulent Convection in Stars ASCI in 2004 MEMORY NSF in 2004 (Projected) BYTES 1012 QCD 2000 NSF Leading Edge Computational Cosmology 100 year climate model in hours 1010 1995 NSF Capability Atomic/Diatomic Interaction Molecular Dynamics for Biological Molecules 108 108 1012 1014 1016 1018 1020 1010 MACHINE REQUIREMENT IN FLOPS Scientific Applications Continue to Require Exponential Growth in Capacity From Bob Voigt, NSF

The Promise of the Teraflop - From Thunderstorm to National-Scale Simulation Simulation by Wilhelmson, et al.; Figure from Supercomputing and the Transformation of Science, Kaufmann and Smarr, Freeman, 1993

Access to ASCI Leading Edge Supercomputers Academic Strategic Alliances Program Data and Visualization Corridors Accelerated Strategic Computing Initiative is Coupling DOE Defense Labs to Universities http://www.llnl.gov/asci-alliances/centers.html

Comparison of the DoE ASCI and the NSF PACI Origin Array Scale Through FY99 Los Alamos Origin System FY99 5-6000 processors NCSA Proposed System FY99 6x128 and 4x64=1024 processors www.lanl.gov/projects/asci/bluemtn /Hardware/schedule.html

CM-5 CM-2 NCSA Combines Shared Memory Programming with Massive Parallelism Future Upgrade Under Negotiation with NSF

The Exponential Growth of NCSA’s SGI Shared Memory Supercomputers Doubling Every Nine Months! SN1 Origin Power Challenge Challenge

TOP500 Systems by Vendor 500 Other Japanese Other DEC 400 Intel Japanese TMC Sun DEC Intel HP 300 TMC IBM Sun Number of Systems Convex HP 200 Convex SGI IBM SGI 100 CRI CRI 0 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 TOP500 Reports: http://www.netlib.org/benchmark/top500.html

Why NCSA Switched From Vector to RISC Processors NCSA 1992 Supercomputing Community 150 Average Speed 70 MFLOPS Cray Y-MP4 / 64 March, 1992 - February, 1993 100 Average Performance, Users > 0.5 CPU Hour Number of Users Peak Speed Y-MP1 50 Peak Speed MIPS R8000 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 Average User MFLOPS

500 MPP 400 SMP/DSM PVP 300 Top500 Installed SC’s 200 100 0 Jun-98 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Replacement of Shared Memory Vector Supercomputers by Microprocessor SMPs TOP500 Reports: http://www.netlib.org/benchmark/top500.html

SMP + DSM Systems PVP Systems 300 300 Europe USA Japan USA 200 200 Number of Systems Number of Systems 100 100 0 0 Jun-98 Jun-96 Jun-94 Jun-95 Jun-97 Nov-93 Nov-97 Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-93 Nov-94 Nov-95 Nov-93 Nov-94 Nov-95 Nov-96 Nov-97 Nov-96 Top500 Shared Memory Systems Vector Processors Microprocessors TOP500 Reports: http://www.netlib.org/benchmark/top500.html

Simulation of the Evolution of the Universe on a Massively Parallel Supercomputer 4 Billion Light Years 12 Billion Light Years Virgo Project - Evolving a Billion Pieces of Cold Dark Matter in a Hubble Volume - 688-processor CRAY T3E at Garching Computing Centre of the Max-Planck-Society http://www.mpg.de/universe.htm

Limitations of Uniform Grids for Complex Scientific and Engineering Problems Gravitation Causes Continuous Increase in Density Until There is a Large Mass in a Single Grid Zone 512x512x512 Run on 512-node CM-5 Source: Greg Bryan, Mike Norman, NCSA

Use of Shared Memory Adaptive Grids To Achieve Dynamic Load Balancing 64x64x64 Run with Seven Levels of Adaption on SGI Power Challenge, Locally Equivalent to 8192x8192x8192 Resolution Source: Greg Bryan, Mike Norman, John Shalf, NCSA

Extreme and Large PIs Dominant Usage of NCSA Origin January thru April, 1998

Disciplines Using the NCSA Origin 2000CPU-Hours in March 1995

A Variety of Discipline Codes -Single Processor Performance Origin vs. T3E

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024) Source: Danesh Tafti, NCSA

Alliance PACS Origin2000 Repository Kadin Tseng, BU, Gary Jensen, NCSA, Chuck Swanson, SGI John Connolly, U Kentucky Developing Repository for HP Exemplar http://scv.bu.edu/SCV/Origin2000/

NEC SX-5 32 x 16 vector processor SMP 512 Processors 8 Gigaflop Peak Vector Processor IBM SP 256 x 16 RISC Processor SMP 4096 Processors 1 Gigaflop Peak RISC Processor SGI Origin Follow-on 32 x 128 RISC Processor DSM 4096 Processors 1 Gigaflop Peak EPIC Processor High-End Architecture 2000-Scalable Clusters of Shared Memory Modules Each is 4 Teraflops Peak

Emerging Portable Computing Standards • HPF • MPI • OpenMP • Hybrids of MPI and OpenMP

Basket of Applications Average Performance as Percentage of Linpack Performance 22% Applications Codes: CFD Biomolecular Chemistry Materials QCD 25% 14% 19% 33% 26%

Harnessing Distributed UNIX Workstations - University of Wisconsin Condor Pool Condor Cycles CondorView, Courtesy of Miron Livny, Todd Tannenbaum(UWisc)

NT Workstation Shipments Rapidly Surpassing UNIX Source: IDC, Wall Street Journal, 3/6/98

First Scaling Testing of ZEUS-MP on CRAY T3E and Origin vs. NT Supercluster “Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html • Alliance Cosmology Team • Andrew Chien, UIUC • Rob Pennington, NCSA Zeus-MP Hydro Code Running Under MPI

NCSA NT Supercluster Solving Navier-Stokes Kernel Single Processor Performance: MIPS R10k 117 MFLOPS Intel Pentium II 80 MFLOPS Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024) Danesh Tafti, Rob Pennington, Andrew Chien NCSA

Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR Equations Cactus was Developed by Paul Walker, MPI-Potsdam UIUC, NCSA Ratio of GFLOPs Origin = 2.5x NT SC Danesh Tafti, Rob Pennington, Andrew Chien NCSA

Parallel Computing on NT Clusters Briand Sanderson, NCSA Microsoft Co-Funds Development Features Based on Microsoft DCOM Batch or Interactive Modes Application Development Wizards Current Status & Future Plans Symbio Developer Preview 2 Released Princeton University Testbed NCSA Symbio - A Distributed Object Framework Bringing Scalable Computing to NT Desktops http://access.ncsa.uiuc.edu/Features/Symbio/Symbio.html

The Road to Merced http://developer.intel.com/solutions/archive/issue5/focus.htm#FOUR

Supercomputing: Directions in Technology, Architecture and Applications