capability computing challenges and payoffs l.
Skip this Video
Loading SlideShow in 5 Seconds..
Capability Computing Challenges and Payoffs PowerPoint Presentation
Download Presentation
Capability Computing Challenges and Payoffs

Loading in 2 Seconds...

play fullscreen
1 / 61

Capability Computing Challenges and Payoffs - PowerPoint PPT Presentation

  • Uploaded on

Capability Computing Challenges and Payoffs Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Professor of Physics, University of Pittsburgh December 10, 2003 Simulation is becoming an increasingly indispensable tool in all areas of science.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Capability Computing Challenges and Payoffs' - johana

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
capability computing challenges and payoffs

Capability ComputingChallenges and Payoffs

Ralph Roskies

Scientific Director, Pittsburgh Supercomputing Center

Professor of Physics, University of Pittsburgh

December 10, 2003

simulation is becoming an increasingly indispensable tool in all areas of science
Simulation is becoming an increasingly indispensable tool in all areas of science.
  • Driven by relentless implications of Moore’s Law that has the price of equivalent computing dropping by (at least) a factor of 2 every 18 months
  • Simulation leads to new insights.
  • As computing gets stronger and the models more realistic, more and more phenomena can be effectively simulated. It sometimes becomes cheaper, faster, more accurate to simulate than to do experiments
  • Progress in modeling will be greatly speeded by the new ability to couple experiments to simulation
  • Both capacity computing and capability computing essential
why capability computing
Why capability computing?
  • Many important problems require tightly-coupled leading-edge computing capability
  • Real-time constraints may require the highest end capability
    • Weather forecasting, storm modeling
    • Interactive requirements
psc terascale computing system
PSC Terascale Computing System
  • Designed a machine its its operation to facilitate the highest capability computations
  • At its introduction (Oct 2001) was number 3 most powerful machine in the world. (now 12).
challenges of capability computing
Challenges of Capability Computing
  • Technical
    • Machine bottlenecks
    • Reliability
    • Power and Space Needs
  • Operational
    • Scheduling
    • Maintenance
    • User support
  • Cultural
    • Users
    • Vendor
  • Political
    • Concentrating resources justifiable only if results otherwise unobtainable
technical machine bottlenecks
Technical: Machine bottlenecks
  • Processor performance (usually memory bandwidth)
    • Peak flops or Linpack not the measure
    • Commodity processors required by fiscal considerations
  • Memory size
    • Global shared memory not very important
    • At least one GB/processor
  • Interprocessor communication
    • Essential for scaling large problems
    • Want low latency, high bandwidth and redundancy
hpc also means massive data handling data repositories and visualization
HPC also means massive data handling, data repositories, and visualization
  • Input/Output
    • Take advantage of parallel IO from each processor
    • Major demands come from snapshots and checkpointing
    • Wrote optimized routines using underlying Quadrics capabilities to speed IO and file transfer
  • Coupled visualization sector (Linux Intel PCs with Nvidia cards) into Terascale system over Quadrics switch
  • Designed cost-effective global disk system linked to HSM
  • High-speed networking
terascale computing system




Compute Nodes


File Servers





Mass Store





Terascale Computing System


  • 750 Compute Nodes
  • 3000 Alpha processors
  • 6 Tf peak
  • 3 TB memory
  • 40 TB local disk
  • Multi-rail fat-tree network
  • Redundant monitor/ctrl
  • WAN/LAN accessible
  • Parallel visualization
  • File servers: 30TB, 32 GB/s
  • Mass store, ~1 TB/hr
technical reliability
Technical: Reliability
  • 750 servers. If each has one failure a year, this system fails twice a day. Most calculations take longer than that.
    • Solution is redundancy where feasible
    • Spares
    • Checkpoint/restart capability
  • Vendor has no way to test and validate software updates on a system this size.
    • Solution is cooperative effort to validate code right on our machine.
operational scheduling
Operational: Scheduling
  • Had been doing a drain at 8pm every night
  • Costs about 5% in throughput
  • Experimenting with continuous drain
  • Reservations for real-time work, large scale debugging
operational maintenance
Operational: Maintenance
  • Rapid vendor response to failure not the thing to focus on
  • Highly instrumented dark machine room
  • Spares
  • “Bring out your dead”
operational user support
Operational: User support
  • Legacy codes can’t just be scaled up
    • (“we’re not computer scientists- we just want to get our work done”)
  • For performance, codes designed for tens of processors have to be rethought and rewritten
    • Ratio of computation to communication changes
    • Scaling may require new algorithms
    • Load balancing must be done dynamically
    • May have to change libraries

Solution is to make PSC consultants de-facto members

of the research group. Work very closely with users.

“Large calculations have the flavor of big experiments. You need

someone monitoring, scheduling, facilitating.”

consultant contributions
Consultant contributions
  • Optimize code
  • Optimize IO (e.g. aggregating messages)
  • Internal advocates
    • With systems group to facilitate scheduling, special requests e.g. larger temporary disk assignment
    • To vendor (PSC is the customer)
  • Workshops on optimization, scaling, load balancing
lessons from scaling workshop
Lessons from scaling workshop
  • Control granularity; Virtualize
    • Define problem in terms of a large number of small objects greater than the number of processors
    • Let the system map objects to processors. Time consuming objects can be broken down into shorter ones, which allows better load balancing.
  • Incorporate latency tolerance
    • Overlap communication with computation
    • If multiple objects on one processor are sending messages to another, aggregate them
    • If messages trigger computation, pipeline them to initiate computation earlier
    • Don’t wait-speculate, pre-fetch
lessons from scaling workshop18
Lessons from scaling workshop
  • Reduce dependency on synchronization
    • Regular communications often rely on synchronization
    • Heterogeneity exacerbates problem
  • Maintain per-process load
    • Requires distributed monitoring capabilities
    • Let the system map objects to processes
  • Use optimized libraries (e.g. ATLAS)
  • Develop performance models (machine profiles; application signatures) to anticipate bottlenecks

Only new aspect is the degree to which these

things matter

case study namd scalable molecular dynamics
Case Study:NAMD Scalable Molecular Dynamics
  • Three-dimensional object-oriented code
  • Message-driven execution capability
  • Asynchronous communications
some scaling successes at psc
Some scaling successes at PSC
  • NAMD now scales to 3000 processors, > 1Tf sustained
  • Earthquake simulation code, 2048 processors, 87% parallel efficiency.
  • ‘Real-time’ tele-immersion code scales to 1536 processors
  • Increased scaling of the Car-Parrinello Ab-Initio Molecular Dynamics (CPAIMD) code from its previous limit of 128 processors (for 128 states) to 1536 processors.
payoffs insight into important real life problems
Payoffs- Insight into important real-life problems
  • Insights
    • Structure to function of biomolecules
  • Increased realism to confront experimental data
    • Earthquakes and design of buildings
    • QCD
  • Novel uses of HPC
    • Teleimmersion
    • Internet simulation
how aquaporins work schulten group university of illinois
How Aquaporins Work (Schulten group, University of Illinois)
  • Aquaporins are proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions.
  • Start with known crystal structure, simulate 12 nanoseconds of molecular dynamics of over 100,000 atoms, using NAMD

Aquaporin mechanism

Water moves through aquaporin channels in single file. Oxygen leads the way in. At the most constricted point of channel, water molecule flips.

Protonscan’t do this.

Animation pointed to by 2003 Nobel chemistry prize announcement

high resolution forward and inverse earthquake modeling on terascale computers
High Resolution Forward and Inverse Earthquake Modeling on Terascale Computers

Volkan Akcelik, Jacobo Bielak, Ioannis Epanomeritakis

Antonio Fernandez, Omar Ghattas, Eui Joong Kim

Julio Lopez, David O'Hallaron, Tiankai Tu

Carnegie Mellon University

George Biros

University of Pennsylvania

John Urbanic

Pittsburgh Supercomputing Center

complexity of earthquake ground motion simulation
Complexity of earthquake ground motion simulation
  • Multiple spatial scales
    • wavelengths vary from O(10m) to O(1000m)
    • Basin/source dimensions are O(100km)
  • Multiple temporal scales
    • O(0.01s) to resolve highest frequencies of source
    • O(10s) to resolve of shaking within the basin
  • So need unstructured grids even though good parallel performance harder to achieve
  • Highly irregular basin geometry
  • Highly heterogeneous soils material properties
  • Geology and source parameters observable

only indirectly

performance of forward earthquake modeling code on psc terascale system
Performance of forward earthquake modeling code on PSC Terascale system

Largest simulation

  • 28 Oct 2001 Compton aftershock in

Greater LA Basin

  • maximum resolved frequency: 1.85Hz
  • 100m/s min shear wave velocity
  • physical size: 100x100x37.5 km3
  • # of elements: 899,591,066
  • # of grid points: 1,023,371,641
  • # of slaves: 125,726,862
  • 25 sec wallclock/time step on 1024 PEs
  • 65 Gb input

lemieux at PSC

role of psc
Role of PSC

Assistance in

  • Optimization
  • Efficient IO of terabyte size datasets
  • Expediting scheduling
  • Visualization
inverse problem use records of past seismic events to improve velocity model
Inverse problem: Use records of past seismic events to improve velocity model

S. CA significant earthquakes since 1812

Seismometer locations and intensity map

for Northridge earthquake

major recognition
Major recognition

This entire effort won Gordon Bell prize for special achievement, 2003, the premier prize for outstanding computations in HPC. Given to the entry that utilizes innovative techniques to demonstrate the most dramatic gain in sustained performance for an important class of real-world application.

  • Increased realism to confront experimental data
    • QCD – compelling evidence for the need to include quark virtual degrees of freedom
    • Improvements due to continued algorithmic development, access to major platforms and sustained effort over decades
tele immersion real time
Tele-immersion (real time)

Henry Fuchs, U. of North Carolina

can process 6 frames/sec (640 x 480) from 10 camera triplets using 1800 processors.

simulating network traffic almost real time
Simulating Network traffic(almost real time)

George Riley et al (Georgia Tech)

  • Simulating networks with > 5M elements.
    • modeled 106M packet transmissions in one second of wall clock time, using 1500 processors
  • Near real time web traffic simulation
    • Empirical HTTP Traffic model [Mah, Infocom ‘97]
    • 1.1M nodes, 1.0M web browsers, 20.5M TCP Connections
    • 541 seconds of wall clock time on 512 processors to simulate 300 seconds of network operation
  • Fastest detailed computer simulations of computer networks ever constructed
where are grids in all this
Where are grids in all this?
  • Grids aimed primarily at:
    • Availability- computing on demand
    • Reduce influence effect of geographic distance
    • Make services more transparent
  • Motivated by remote data, on-line instruments, sensors, as well as computers
  • They also contribute to the highest end by aggregating resources.

“ The emerging vision is to use cyberinfrastructure to build more ubiquitous, comprehensive digital environments that become interactive and functionally complete for research communities in terms of people, data, information, tools, and instruments and that operate at unprecedented levels of computational, storage, and data transfer capacity.”

NSF Blue Ribbon Panel on Cyberinfrastructure

dtf 2001
DTF (2001)

IA 64 clusters at 4 sites

10 Gb/s point to point links

Can deliver 30 Gb/s

between 2 sites







Physical Topology

(Full Mesh)

extensible terascale facility 2002
Extensible Terascale Facility (2002)
  • Make network scalable, so introduce hubs
  • Allow heterogeneous architecture, and retain interoperability
  • First step is integration of PSC’s TCS machine
  • Many more computer science interoperability issues

3 new sites approved in 2003

(Texas, Oak Ridge, Indiana)

examples of science drivers
Examples of Science Drivers
  • GriPhyn - Particle physics- Large Hadron Collider at CERN
    • Overwhelming amount of data for analysis (>1 PB/year)
    • Find rare events resulting from the decays of massive new particles in a dominating background
    • Need new services to support world-wide data access and remote collaboration for coordinated management of distributed computation and data without centralized control
examples of science drivers39
Examples of Science Drivers
  • NVO- National Virtual Observatory
    • Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogues, in different wavebands, from gamma- and X-rays, optical, infrared, through radio.
    • Soon it will be easier to "dial-up" a part of the sky than wait many months to access a telescope.
    • Need multi-terabyte on-line databases interoperating seamlessly, interlinked catalogues, sophisticated query engines
  • research results from on-line data will be just as rich

as that from "real" telescopes

uk teragrid hpc grid experiment
UK – Teragrid HPC-Grid Experiment

TeraGyroid: Lattice-Boltzmann simulations of defect dynamics in amphiphilic liquid crystals

  • Peter Coveney (University College London),
  • Richard Blake (Daresbury Lab)
  • Stephen Pickles (Manchester).
  • Bruce Boghosian (Tufts)


project partners
Project Partners

Reality Grid partners:

  • University College London (Application, Visualisation, Networking)
  • University of Manchester (Application, Visualisation, Networking)
  • Edinburgh Parallel Computing Centre (Application)
  • Tufts University (Application)

UK High-End Computing Services

- HPCx run by the University of Edinburgh and CCLRC Daresbury Laboratory (Compute, Networking, Coordination)

- CSAR run by the University of Manchester and CSC (Compute and Visualisation)

  • Teragrid sites at:
    • Argonne National Laboratory (Visualization, Networking)
    • National Center for Supercomputing Applications (Compute)
    • Pittsburgh Supercomputing Center (Compute, Visualisation)
    • San Diego Supercomputer Center (Compute)
project explanation
Project explanation
  • Amphiphiles are chemicals with hydrophobic (water-avoiding) tails and hydrophilic (water attracting) heads. When dispersed in solvents or oil/water mixtures, self assemble into complex shapes; some (gyroids) are of particular interest in biology.
  • Shapes depend on parameters like
    • abundance and initial distribution of each component
    • the strength of the surfactant-surfactant coupling,
  • Desired structures can sometimes only be seen in very large systems. E.g. smaller region form gyroids in different directions and how they then interact is of major significance.
  • Project goal is to study defect pathways and dynamics

in gyroid self-assembly


















BT provision


distribution of function
Distribution of function
  • Computations run at HPCx, CSAR, SDSC, PSC and NCSA. (7 TB memory - 5K processors in integrated resource) One Gigabit of LB3D data is generated per simulation time-step.
  • Visualisation run at Manchester/ UCL/ Argonne
  • Scientists steering calculations from UCL and Boston over Access Grid. Steering requires reliable near-real time data transport across the Grid to visualization engines.
  • Visualisation output and collaborations multicast to SC03 Phoenix and visualised on the show floor in the University of Manchester booth
exploring parameter space through computational steering
Exploring parameter spacethrough computational steering

Cubic micellar phase, low surfactant density gradient.

Cubic micellar phase, high surfactant density gradient.

Initial condition: Random water/ surfactant mixture.

Self-assembly starts.

Lamellar phase: surfactant bilayers between water layers.

Lamellar phase: surfactant bilayers between water layers.

Rewind and restart from checkpoint.

  • Linking these resources allowed computation of the largest set of lattice-Boltzmann (LB) simulations ever performed, involving lattices of over one billion sites.
how do upcoming developments deal with the major technical issues
How do upcoming developments deal with the major technical issues
  • Memory bandwidth
    • Old Crays- 2loads and a store/clock= 12B/flop
    • TCS, better than most commodity processors 1 B/flop
    • Earth Simulator 4 B/flop
  • Power
    • TCS, ~700 kW
    • Earth Simulator, ~4 MW
  • Space
    • TCS, ~2500 sq feet
    • ASCI Q New machine room of ~40,000 sq feet
    • Earth Simulator, 3250 sq meters
  • Reliability
short term responses
Short term responses
  • Livermore, BlueGene/L
  • Sandia, Red Storm
bluegene l livermore
BlueGene/L (Livermore)
  • System on a chip
    • IBM powerPC with reduced clock (700 Mhz) for lower power consumption
      • 2 processor/node each 2.8 GF peak
      • 256 MB/node (small, but allows up to 2GB/node)
    • Memory on chip, to increase memory bandwidth to 2Bytes/flop
    • Communications processor on chip, speeds interprocessor communication (175MB/s/link)
    • Total 360 Tf, 65536 nodes in 3D torus
    • Total power 1MW
    • floor space 2500 sq ft
    • very fault tolerant (expect 1 failure/week)
bluegene l science
BlueGene/L Science
  • Protein folding (molecular dynamics needs small memory and large floating point capability)
  • Materials science, (again molecular dyanmics)
red storm sandia
Red Storm (Sandia)
  • Inspired by the T3E- a true MPP
  • Opteron chip from AMD,
    • 2 Ghz clock, 4 Gflop, 1.3B/flop memory bandwidth
  • High-bandwidth proprietary interconnect (from Cray)
    • bandwidth of 6.4 GB/sec, as good as local memory
  • 10,000 cpus, 3-d torus
  • 40 Tf peak, 10TB memory
  • <2MW, < 3000 sq ft
  • Much emphasis on RAS (Reliability, Availability, Serviceability)
scalability considerations
Scalability considerations
  • System Node speed Interconnect Ratio (Mflops/s) (MB/s) B/flop
  • X1 51200 100000 1.95
  • Red Storm 4000 6400 1.6
  • ASCI Red 666 800 1.2
  • Cray T3E 1200 1200 1.0
  • BlueGene/L 5600 1050 0.19
  • Earth Simulator 64000 12300 0.19
  • ASCI Blue Mountain 64000 1200 .02
  • ASCI White 10000 2000 .083
  • LANL Pink 9600 250 .026
  • PSC Alpha Cluster 8000 700 .0875
  • ASCI Purple 218000 32000 .147

Red are MPPs, Green are SMPs

longer term responses
Longer term responses
  • DARPA High Productivity Computing Systems
    • Now involves Cray, SUN and IBM (formerly also HP and SGI)
    • Ultimate time frame is 2010, but prototype design now due in 2007
    • New architectural innovations
    • Also stresses programmer productivity as much as raw machine speed
longer term response high productivity computing systems
Longer Term ResponseHigh Productivity Computing Systems
  • Goal:
    • Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)


  • Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X
  • Programmability (time-for-idea-to-first-solution): reduce cost and time of developing application solutions
  • Portability (transparency): insulate research and operational application software from system
  • Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors

HPCS Program Focus Areas

  • Applications:
  • Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology

Fill the Critical Technology and Capability Gap

Today (late 80’s HPC technology)……..Future (Quantum/Bio Computing)

materials science requirements
Materials Science Requirements

Electronic structures:

  • Current: ~300 atom: 0.5 Tflop/s, 100 Gbyte memory.
  • Future: ~3000 atom: 50 Tflop/s, 2 Tbyte memory.

Magnetic materials:

  • Current: ~2000 atom: 2.64 Tflop/s, 512 Gbytes
  • Future: hard drive simulation: 30 Tflop/s, 2 Tbyte

Molecular dynamics:

  • Current: 109 atoms, ns time scale: 1 Tflop/s, 50 Gbytes
  • Future: alloys, us time scale: 20 Tflop/s, 4 Tbytes
climate modeling requirements
Climate Modeling Requirements

Current state-of-the-art:

  • Atmosphere: 1 x 1.25 deg spacing, with 29 vertical layers.
  • Ocean: 0.25 x 0.25 degree spacing, 60 vertical layers.
  • Currently requires 52 seconds CPU time per simulated day.

Future requirements (to resolve ocean mesoscale eddies):

  • Atmosphere: 0.5 x 0.5 deg spacing.
  • Ocean: 0.125 x 0.125 deg spacing.
  • Computational requirement: 17 Tflop/s.

Future goal: resolve tropical cumulus clouds:

  • 2 to 3 orders of magnitude more than above.
fusion requirements
Fusion Requirements

Tokamak simulation -- ion temperature gradient turbulence in ignition experiment:

  • Grid size: 3000 x 1000 x 64, or about 2 x 108 grid points.
  • Each grid cell contains 8 particles, for total of 1.6 x 109.
  • 50,000 time steps required.
  • Total cost: 3.2 x 1017 flop/s, 1.6 Tbyte.

All-Orders Spectral Algorithm (AORSA) – to address effects of RF electromagnetic waves in plasmas.

  • 120,000 x120,000 complex linear system.
  • 230 Gbyte memory.
  • 1.3 hours on 1 Tflop/s.
  • 300,000 x 300,000 linear system requires 8 hours.
  • Future: 6,000,000 x 6,000,000 system (576 Tbyte memory), 160 hours on 1 Pflop/s system.
accelerator modeling requirements
Accelerator Modeling Requirements

Current computations:

  • 1283 to 5123 cells, or 40 million to 2 billion particles.
  • Currently requires 10 hours on 256 CPUs.

Future computations:

  • Modeling intense beams in rings will be 100 to 1000 times more challenging.
astrophysics requirements
Astrophysics Requirements

Supernova simulation:

    • 3-d understanding of Type 1a supernovas, “standard candles” in calculating distances to remote galaxies, require 2,000,000 CPU-hours, > exceeding 256 Gbyte
  • Analysis of cosmic microwave background data:
    • MAXIMA data, 5.3 x 1016 flops, 100 Gbyte mem
    • BOOMERANG data 1019 flops, 3.2 Tbyte mem
    • Future MAP data, 1020 flops, 16 Tbyte mem
    • Future PLANCK data 1023 flops, 1.6 Pbyte mem
take home lessons
Take home lessons
  • Fielding capability computing takes considerable thought and expertise
  • Computing power continues to grow, and it will be accessible to you
  • Think of challenging problems to stress existing systems and justify more powerful ones