Capability ComputingChallenges and Payoffs Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Professor of Physics, University of Pittsburgh December 10, 2003
Simulation is becoming an increasingly indispensable tool in all areas of science. • Driven by relentless implications of Moore’s Law that has the price of equivalent computing dropping by (at least) a factor of 2 every 18 months • Simulation leads to new insights. • As computing gets stronger and the models more realistic, more and more phenomena can be effectively simulated. It sometimes becomes cheaper, faster, more accurate to simulate than to do experiments • Progress in modeling will be greatly speeded by the new ability to couple experiments to simulation • Both capacity computing and capability computing essential
Why capability computing? • Many important problems require tightly-coupled leading-edge computing capability • Real-time constraints may require the highest end capability • Weather forecasting, storm modeling • Interactive requirements
PSC Terascale Computing System • Designed a machine its its operation to facilitate the highest capability computations • At its introduction (Oct 2001) was number 3 most powerful machine in the world. (now 12).
Challenges of Capability Computing • Technical • Machine bottlenecks • Reliability • Power and Space Needs • Operational • Scheduling • Maintenance • User support • Cultural • Users • Vendor • Political • Concentrating resources justifiable only if results otherwise unobtainable
Technical: Machine bottlenecks • Processor performance (usually memory bandwidth) • Peak flops or Linpack not the measure • Commodity processors required by fiscal considerations • Memory size • Global shared memory not very important • At least one GB/processor • Interprocessor communication • Essential for scaling large problems • Want low latency, high bandwidth and redundancy
HPC also means massive data handling, data repositories, and visualization • Input/Output • Take advantage of parallel IO from each processor • Major demands come from snapshots and checkpointing • Wrote optimized routines using underlying Quadrics capabilities to speed IO and file transfer • Coupled visualization sector (Linux Intel PCs with Nvidia cards) into Terascale system over Quadrics switch • Designed cost-effective global disk system linked to HSM • High-speed networking
Quadrics Control LAN Compute Nodes Interactive File Servers /tmp /home WAN/LAN Viz Mass Store WAN/LAN WAN/LAN Archive buffer Terascale Computing System Summary • 750 Compute Nodes • 3000 Alpha processors • 6 Tf peak • 3 TB memory • 40 TB local disk • Multi-rail fat-tree network • Redundant monitor/ctrl • WAN/LAN accessible • Parallel visualization • File servers: 30TB, 32 GB/s • Mass store, ~1 TB/hr
Technical: Reliability • 750 servers. If each has one failure a year, this system fails twice a day. Most calculations take longer than that. • Solution is redundancy where feasible • Spares • Checkpoint/restart capability • Vendor has no way to test and validate software updates on a system this size. • Solution is cooperative effort to validate code right on our machine.
Operational: Scheduling • Had been doing a drain at 8pm every night • Costs about 5% in throughput • Experimenting with continuous drain • Reservations for real-time work, large scale debugging
Operational: Maintenance • Rapid vendor response to failure not the thing to focus on • Highly instrumented dark machine room • Spares • “Bring out your dead”
Operational: User support • Legacy codes can’t just be scaled up • (“we’re not computer scientists- we just want to get our work done”) • For performance, codes designed for tens of processors have to be rethought and rewritten • Ratio of computation to communication changes • Scaling may require new algorithms • Load balancing must be done dynamically • May have to change libraries Solution is to make PSC consultants de-facto members of the research group. Work very closely with users. “Large calculations have the flavor of big experiments. You need someone monitoring, scheduling, facilitating.”
Consultant contributions • Optimize code • Optimize IO (e.g. aggregating messages) • Internal advocates • With systems group to facilitate scheduling, special requests e.g. larger temporary disk assignment • To vendor (PSC is the customer) • Workshops on optimization, scaling, load balancing
Lessons from scaling workshop • Control granularity; Virtualize • Define problem in terms of a large number of small objects greater than the number of processors • Let the system map objects to processors. Time consuming objects can be broken down into shorter ones, which allows better load balancing. • Incorporate latency tolerance • Overlap communication with computation • If multiple objects on one processor are sending messages to another, aggregate them • If messages trigger computation, pipeline them to initiate computation earlier • Don’t wait-speculate, pre-fetch
Lessons from scaling workshop • Reduce dependency on synchronization • Regular communications often rely on synchronization • Heterogeneity exacerbates problem • Maintain per-process load • Requires distributed monitoring capabilities • Let the system map objects to processes • Use optimized libraries (e.g. ATLAS) • Develop performance models (machine profiles; application signatures) to anticipate bottlenecks Only new aspect is the degree to which these things matter
Case Study:NAMD Scalable Molecular Dynamics • Three-dimensional object-oriented code • Message-driven execution capability • Asynchronous communications
Some scaling successes at PSC • NAMD now scales to 3000 processors, > 1Tf sustained • Earthquake simulation code, 2048 processors, 87% parallel efficiency. • ‘Real-time’ tele-immersion code scales to 1536 processors • Increased scaling of the Car-Parrinello Ab-Initio Molecular Dynamics (CPAIMD) code from its previous limit of 128 processors (for 128 states) to 1536 processors.
Payoffs- Insight into important real-life problems • Insights • Structure to function of biomolecules • Increased realism to confront experimental data • Earthquakes and design of buildings • QCD • Novel uses of HPC • Teleimmersion • Internet simulation
How Aquaporins Work (Schulten group, University of Illinois) • Aquaporins are proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions. • Start with known crystal structure, simulate 12 nanoseconds of molecular dynamics of over 100,000 atoms, using NAMD
Aquaporin mechanism Water moves through aquaporin channels in single file. Oxygen leads the way in. At the most constricted point of channel, water molecule flips. Protonscan’t do this. Animation pointed to by 2003 Nobel chemistry prize announcement
High Resolution Forward and Inverse Earthquake Modeling on Terascale Computers Volkan Akcelik, Jacobo Bielak, Ioannis Epanomeritakis Antonio Fernandez, Omar Ghattas, Eui Joong Kim Julio Lopez, David O'Hallaron, Tiankai Tu Carnegie Mellon University George Biros University of Pennsylvania John Urbanic Pittsburgh Supercomputing Center
Complexity of earthquake ground motion simulation • Multiple spatial scales • wavelengths vary from O(10m) to O(1000m) • Basin/source dimensions are O(100km) • Multiple temporal scales • O(0.01s) to resolve highest frequencies of source • O(10s) to resolve of shaking within the basin • So need unstructured grids even though good parallel performance harder to achieve • Highly irregular basin geometry • Highly heterogeneous soils material properties • Geology and source parameters observable only indirectly
Performance of forward earthquake modeling code on PSC Terascale system Largest simulation • 28 Oct 2001 Compton aftershock in Greater LA Basin • maximum resolved frequency: 1.85Hz • 100m/s min shear wave velocity • physical size: 100x100x37.5 km3 • # of elements: 899,591,066 • # of grid points: 1,023,371,641 • # of slaves: 125,726,862 • 25 sec wallclock/time step on 1024 PEs • 65 Gb input lemieux at PSC
Role of PSC Assistance in • Optimization • Efficient IO of terabyte size datasets • Expediting scheduling • Visualization
Inverse problem: Use records of past seismic events to improve velocity model S. CA significant earthquakes since 1812 Seismometer locations and intensity map for Northridge earthquake
Major recognition This entire effort won Gordon Bell prize for special achievement, 2003, the premier prize for outstanding computations in HPC. Given to the entry that utilizes innovative techniques to demonstrate the most dramatic gain in sustained performance for an important class of real-world application.
QCD • Increased realism to confront experimental data • QCD – compelling evidence for the need to include quark virtual degrees of freedom • Improvements due to continued algorithmic development, access to major platforms and sustained effort over decades
Tele-immersion (real time) Henry Fuchs, U. of North Carolina can process 6 frames/sec (640 x 480) from 10 camera triplets using 1800 processors.
Simulating Network traffic(almost real time) George Riley et al (Georgia Tech) • Simulating networks with > 5M elements. • modeled 106M packet transmissions in one second of wall clock time, using 1500 processors • Near real time web traffic simulation • Empirical HTTP Traffic model [Mah, Infocom ‘97] • 1.1M nodes, 1.0M web browsers, 20.5M TCP Connections • 541 seconds of wall clock time on 512 processors to simulate 300 seconds of network operation • Fastest detailed computer simulations of computer networks ever constructed
Where are grids in all this? • Grids aimed primarily at: • Availability- computing on demand • Reduce influence effect of geographic distance • Make services more transparent • Motivated by remote data, on-line instruments, sensors, as well as computers • They also contribute to the highest end by aggregating resources. “ The emerging vision is to use cyberinfrastructure to build more ubiquitous, comprehensive digital environments that become interactive and functionally complete for research communities in terms of people, data, information, tools, and instruments and that operate at unprecedented levels of computational, storage, and data transfer capacity.” NSF Blue Ribbon Panel on Cyberinfrastructure
DTF (2001) IA 64 clusters at 4 sites 10 Gb/s point to point links Can deliver 30 Gb/s between 2 sites Caltech ANL LA Chicago SDSC NCSA Physical Topology (Full Mesh)
Extensible Terascale Facility (2002) • Make network scalable, so introduce hubs • Allow heterogeneous architecture, and retain interoperability • First step is integration of PSC’s TCS machine • Many more computer science interoperability issues 3 new sites approved in 2003 (Texas, Oak Ridge, Indiana)
Examples of Science Drivers • GriPhyn - Particle physics- Large Hadron Collider at CERN • Overwhelming amount of data for analysis (>1 PB/year) • Find rare events resulting from the decays of massive new particles in a dominating background • Need new services to support world-wide data access and remote collaboration for coordinated management of distributed computation and data without centralized control
Examples of Science Drivers • NVO- National Virtual Observatory • Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogues, in different wavebands, from gamma- and X-rays, optical, infrared, through radio. • Soon it will be easier to "dial-up" a part of the sky than wait many months to access a telescope. • Need multi-terabyte on-line databases interoperating seamlessly, interlinked catalogues, sophisticated query engines • research results from on-line data will be just as rich as that from "real" telescopes
UK – Teragrid HPC-Grid Experiment TeraGyroid: Lattice-Boltzmann simulations of defect dynamics in amphiphilic liquid crystals • Peter Coveney (University College London), • Richard Blake (Daresbury Lab) • Stephen Pickles (Manchester). • Bruce Boghosian (Tufts) ANL
Project Partners Reality Grid partners: • University College London (Application, Visualisation, Networking) • University of Manchester (Application, Visualisation, Networking) • Edinburgh Parallel Computing Centre (Application) • Tufts University (Application) UK High-End Computing Services - HPCx run by the University of Edinburgh and CCLRC Daresbury Laboratory (Compute, Networking, Coordination) - CSAR run by the University of Manchester and CSC (Compute and Visualisation) • Teragrid sites at: • Argonne National Laboratory (Visualization, Networking) • National Center for Supercomputing Applications (Compute) • Pittsburgh Supercomputing Center (Compute, Visualisation) • San Diego Supercomputer Center (Compute)
Project explanation • Amphiphiles are chemicals with hydrophobic (water-avoiding) tails and hydrophilic (water attracting) heads. When dispersed in solvents or oil/water mixtures, self assemble into complex shapes; some (gyroids) are of particular interest in biology. • Shapes depend on parameters like • abundance and initial distribution of each component • the strength of the surfactant-surfactant coupling, • Desired structures can sometimes only be seen in very large systems. E.g. smaller region form gyroids in different directions and how they then interact is of major significance. • Project goal is to study defect pathways and dynamics in gyroid self-assembly
Edinburgh Glasgow Newcastle Belfast Manchester DL Cambridge Oxford RAL Cardiff London Southampton Networking Netherlight Amsterdam Teragrid BT provision UK
Distribution of function • Computations run at HPCx, CSAR, SDSC, PSC and NCSA. (7 TB memory - 5K processors in integrated resource) One Gigabit of LB3D data is generated per simulation time-step. • Visualisation run at Manchester/ UCL/ Argonne • Scientists steering calculations from UCL and Boston over Access Grid. Steering requires reliable near-real time data transport across the Grid to visualization engines. • Visualisation output and collaborations multicast to SC03 Phoenix and visualised on the show floor in the University of Manchester booth
Exploring parameter spacethrough computational steering Cubic micellar phase, low surfactant density gradient. Cubic micellar phase, high surfactant density gradient. Initial condition: Random water/ surfactant mixture. Self-assembly starts. Lamellar phase: surfactant bilayers between water layers. Lamellar phase: surfactant bilayers between water layers. Rewind and restart from checkpoint.
Results • Linking these resources allowed computation of the largest set of lattice-Boltzmann (LB) simulations ever performed, involving lattices of over one billion sites.
How do upcoming developments deal with the major technical issues • Memory bandwidth • Old Crays- 2loads and a store/clock= 12B/flop • TCS, better than most commodity processors 1 B/flop • Earth Simulator 4 B/flop • Power • TCS, ~700 kW • Earth Simulator, ~4 MW • Space • TCS, ~2500 sq feet • ASCI Q New machine room of ~40,000 sq feet • Earth Simulator, 3250 sq meters • Reliability
Short term responses • Livermore, BlueGene/L • Sandia, Red Storm
BlueGene/L (Livermore) • System on a chip • IBM powerPC with reduced clock (700 Mhz) for lower power consumption • 2 processor/node each 2.8 GF peak • 256 MB/node (small, but allows up to 2GB/node) • Memory on chip, to increase memory bandwidth to 2Bytes/flop • Communications processor on chip, speeds interprocessor communication (175MB/s/link) • Total 360 Tf, 65536 nodes in 3D torus • Total power 1MW • floor space 2500 sq ft • very fault tolerant (expect 1 failure/week)
BlueGene/L Science • Protein folding (molecular dynamics needs small memory and large floating point capability) • Materials science, (again molecular dyanmics)