Supercomputing challenges at the national center for atmospheric research l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Supercomputing Challenges at the National Center for Atmospheric Research PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Supercomputing Challenges at the National Center for Atmospheric Research. Dr. Richard Loft Computational Science Section Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA. Talk Outline. Supercomputing Trends and Constraints

Download Presentation

Supercomputing Challenges at the National Center for Atmospheric Research

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Supercomputing challenges at the national center for atmospheric research l.jpg

Supercomputing Challenges at the National Center for Atmospheric Research

Dr. Richard Loft

Computational Science Section

Scientific Computing Division

National Center for Atmospheric Research

Boulder, CO USA


Talk outline l.jpg

Talk Outline

  • Supercomputing Trends and Constraints

  • Observed NCAR Cluster Performance (Aggregate)

  • Microprocessor efficiency: what is possible?

  • Microprocessor efficiency: recent efforts to improve CAM2 performance.

  • Some RISC/Vector Cluster Comparisons

  • Conclusions


The demand high cost of science goals l.jpg

The Demand: High Cost of Science Goals

  • Climate scientists project a need for 150x more computing power of the next 5 years.

  • T42->T85. Doubling horizontal resolution increases computational cost eightfold.

  • Many additional constituents will be advected.

  • New physics: computational cost of CAM/CCM, holding resolution constant, has increased 4x since 1996. More coming…

  • Future: introducing super-parameterizations of moist processes would increase physics costs dramatically.


Existing infrastructure limits at ncar l.jpg

Existing Infrastructure Limits at NCAR

  • Cooling Capacity

    • 450 tons (1.58 megawatts)

    • Most limiting

    • One P690 node ~ 7.9 KW ~ 2.5 tons

    • Balance cooling with power

  • Power ~ 1.2 MW without modifications

    • Second most limiting

    • Currently NCAR computer room draws 602 KW

    • About 400 kw from IBM clusters

  • Space ~ 14,000 sq.ft.

    • P690 ~ 196 W/sq. ft.

    • Least limiting based on current trends


Mass storage growth l.jpg

Mass Storage Growth

  • 1.3 Pbytes total

  • Adding ~3 Tbytes/day

  • 5 year doubling times -

    • Unique files: 2.1 years

    • File size: 10.4 years

    • Media performance (GB/$) 1.9 years

  • Alarming trends

    • MSS growth rate doubling time has accelerated over past year. Now 18 months.

    • MSS costs are increasing…


Slide9 l.jpg

Observed Cluster Performance (Aggregate)


Ibm clusters at ncar l.jpg

IBM Clusters at NCAR

  • Bluesky: 1024 IBM 1.3 GHz Power-4 cluster

    • 32 P690/32 compute servers

    • 736 in 92, 8 way “nodes” (bluesky8)

    • 288 in 9, 32 way “nodes” (bluesky32)

    • Peak: 5.234 TFlops

    • Dual “Colony” interconnect

  • Blackforest: IBM 375 MHz Power-3 cluster

    • 283 “winterhawk” 4-way SMP’s

    • Peak: 1.698 TFlops

    • TBMX interconnect


Observed ibm cluster efficiencies l.jpg

Observed IBM Cluster Efficiencies

  • Newer systems are less efficient.

  • Larger nodes are more efficient.

  • Max sustained performance: 320.3 GFlops


Why is workload efficiency low l.jpg

Why is workload efficiency low?

  • Computational character of workload average:

    • L3 cache miss rate 31%

    • computational intensity is 0.8

  • Applications are memory bandwidth limited.

    • Simple BW model predicts 5.5% for bluesky32.

  • A good metric of efficiency is Flop/cycle.

    • Factors out dual FPU’s.

    • Bluesky32: 0.18 Flop/cycle

    • Blackforest: 0.23 Flop/cycle


Risc cluster network comparison l.jpg

RISC Cluster Network Comparison

  • IBM Power-4 cluster with dual “Colony” network.

  • IBM Power-3 cluster with single TBMX network.

  • Compaq Alpha cluster with Quadrix network.

  • Bisection Bandwidth

    • Important for global communications

    • XPAIR benchmark initiates all to all communication.

    • Dual Colony P690 local:global BW ratio 50:1

  • Global Reductions

    • For P processors these should scale as log(P).

    • Actually scales linearly.


Cluster network performance l.jpg

Cluster Network Performance


Slide16 l.jpg

Microprocessor efficiency:

What is possible?


Example 3 d fft performance l.jpg

Example: 3-D FFTPerformance

  • Hand tuned multithreaded, 3-D FFT (STK)

  • Three 1-D FFT on each axis with transpositions

  • FFTs are memory bandwidth intensive

    • Both loads and Flop’s scale like N*log(N)

  • The FFT is not multiply-add dominated

  • The FFT butterfly is a non local, strided calculation.

  • Gets more non local as size of FFT increases

  • 1024^3 Transforms on P690 (IBM Power-4)


Slide21 l.jpg

Microprocessor efficiency:

Recent efforts to improve CAM2 performance…


Ccm benchmark performance on existing multiprocessor clusters l.jpg

CCM Benchmark Performance on Existing Multiprocessor Clusters


Slide26 l.jpg

Some RISC/Vector Cluster Comparisons…


Processor comparison l.jpg

Processor Comparison


Ibm p690 cluster l.jpg

IBM P690 Cluster

  • 5.3 TFlops peak

  • 1024 processors (32, 32 way P690 nodes)

  • 5.2 Gflops/processor

  • Observed 4.1-4.5% of peak on NCAR codes

  • Max sustained on workload: 213.5 GFlops

  • Est. Peak Price Performance: $2.6/MFlops

  • Sustained Price Performance: $59/MFlops

  • Sustained Power Performance: 0.7 Gflops/KW


Earth simulator l.jpg

Earth Simulator

  • 40.96 Tflops peak

  • 5120 Processors (640, 8 processor GS40 nodes)

  • 8 Gflops/processor

  • Estimate 30% of peak on NCAR codes

  • Est. Max sustained on workload: 12,200 GFlops

  • Est. Peak Price Performance: $8.5/MFlops

  • Est. Sustained Price Performance: $28/MFlops

  • Est. Sustained Power Performance: 1.525 Gflops/KW


Power 4 die floor plan l.jpg

Power 4 die floor plan


Power 4 cache cpu area comparison l.jpg

Power 4 cache/CPU area comparison


Conclusions l.jpg

Conclusions

  • Infrastructure (power, cooling, space) are becoming critical constraints.

  • NCAR IBM clusters sustain 4.1%-4.5% of peak.

  • Workload is memory bandwidth limited.

  • RISC cluster interconnects are not great.

  • We’re making steady progress learning how to program around these limitations.

  • At this point, vector systems appear to be about 2x more cost effective in both price and power performance.


Pentium 4 die floor plan l.jpg

Pentium-4 die floor plan


Pentium 4 cache cpu comparison l.jpg

Pentium-4 cache/CPU comparison


Itanium ii die floor plan l.jpg

Itanium II die floor plan


Itanium ii cpu cache area comparison l.jpg

Itanium II CPU/cache area comparison


  • Login