An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Mak...
Download
1 / 50

Rich Loft Director, Technology Development Computational and Information Systems Laboratory - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?. Rich Loft Director, Technology Development Computational and Information Systems Laboratory National Center for Atmospheric Research [email protected]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Rich Loft Director, Technology Development Computational and Information Systems Laboratory' - istas


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?

Rich Loft

Director, Technology Development

Computational and Information Systems Laboratory

National Center for Atmospheric Research

[email protected]


Main points
Main Points and Computing Technology We Need to Make Critical Climate Predictions in Time?

  • Nature of the climate system makes it a grand challenge computing problem.

  • We are at a critical juncture: we need regional climate prediction capabilities!

  • Computer clock/thread speeds are stalled: massive parallelism is the future of supercomputing.

  • Our best algorithms, parallelization strategies and architectures are inadequate to the task.

  • We need model acceleration improvements in all three areas if we are to meet the challenge.


Options for application acceleration
Options for Application Acceleration and Computing Technology We Need to Make Critical Climate Predictions in Time?

  • Scalability

    • Eliminate bottlenecks

    • Find more parallelism

    • Load balancing algorithms

  • Algorithmic Acceleration

    • Bigger Timesteps

      • Semi-Lagrangian Transport

      • Implicit or semi-implicit time integration – solvers

    • Fewer Points

      • Adaptive Mesh Refinement methods

  • Hardware Acceleration

    • More Threads

      • CMP, GP-GPU’s

    • Faster threads

      • device innovations (high-K)

    • Smarter threads

      • Architecture - old tricks, new tricks… magic tricks

        • Vector units, GPU’s, FPGA’s


A very grand challenge coupled models of the earth system
A Very Grand Challenge: and Computing Technology We Need to Make Critical Climate Predictions in Time? Coupled Models of the Earth System

~150 km

air column

water column

Viner (2002)

Typical Model Computation:

- 15 minute time steps

- 1 peta-flop per model year

There are 3.5 million timesteps in a century


Multicomponent earth system model
Multicomponent Earth System Model and Computing Technology We Need to Make Critical Climate Predictions in Time?

Coupler

Land

Atmosphere

Ocean

Sea Ice

C/N

Cycle

Dyn.

Veg.

Land

Use

Ecosystem

& BGC

Gas chem.

Prognostic

Aerosols

Upper

Atm.

Ice

Sheets

  • Software Challenges:

    • Increasing Complexity

    • Validation and Verification

    • Understanding the Output

Key concept: A flexible coupling framework is critical!


Climate Change and Computing Technology We Need to Make Critical Climate Predictions in Time?

Credit: Caspar Amman

NCAR


IPCC AR4 - 2007 and Computing Technology We Need to Make Critical Climate Predictions in Time?

  • IPCC AR4: “Warming of the climate system is un-equivocal” …

  • …and it is “very likely” caused by human activities.

  • Most of the observed changes over the past 50 years are now simulated by climate models adding confidence to future projections.

  • Model Resolutions: O(100 km)


Climate change research epochs

2007 and Computing Technology We Need to Make Critical Climate Predictions in Time?

Climate Change Research Epochs

Assess regional impacts

Simulate adaptation strategies

Simulate geoengineering solns

Reproduce historical trends

Investigate climate change

Run IPCC Scenarios

Before IPCCAR4 After

Curiosity Driven

Policy Driven


Where we want to go the exascale earth system model vision
Where we want to go: and Computing Technology We Need to Make Critical Climate Predictions in Time? The Exascale Earth System Model Vision

Coupled Ocean-Land-Atmosphere Model

~1 km x ~1 km (cloud-resolving)

100 levels, whole atmosphere

Unstructured, adaptive grids

~100 m

10 levels

Landscape-resolving

~10 km x ~10 km (eddy-resolving)

100 levels

Unstructured, adaptive grids

Requirement: Computing power enhancement by as much as a factor of 1010-1012

ESSL - The Earth & Sun Systems Laboratory

YIKES!


Compute factors for ultra high resolution earth system model
Compute Factors for ultra-high resolution Earth System Model and Computing Technology We Need to Make Critical Climate Predictions in Time?

(courtesy of John Drake, ORNL)


Why run length global thermohaline circulation timescale 3 000 years
Why run-length: and Computing Technology We Need to Make Critical Climate Predictions in Time? global thermohaline circulation timescale: 3,000 years



Why high resolution in the ocean

Ocean component of CCSM (Collins et al, 2006) : O(1 km)

Eddy-resolving POP (Maltrud & McClean,2005)

Why High Resolution in the Ocean?

0.1˚



Performance improvements are not coming fast enough
Performance Improvements are not coming fast enough! : O(1 km)

…suggests 1010 to 1012 improvement will take 40 years


Itrs roadmap feature size dropping 14 year
ITRS Roadmap: : O(1 km)feature size dropping 14%/year

By 2050 reaches the size of an atom – oops!


National Security Agency : O(1 km) - The power consumption of today's advanced computing systems is rapidly becoming the limiting factor with respect to improved/increased computational ability." 


Chip level trends stagnant clock speed
Chip Level Trends: Stagnant Clock Speed : O(1 km)

  • Chip density is continuing increase ~2x every 2 years

    • Clock speed is not

    • Number of cores are doubling instead

  • There is little or no additional hidden parallelism (ILP)

  • Parallelism must be exploited by software

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)


Moore s law more s law speed up through increasing parallelism
Moore’s Law -> More’s Law: : O(1 km)Speed-up through increasing parallelism

How long can we double the number of cores per chip?


Dr henry tufo and myself with frost 2005

NCAR and University Colorado Partner : O(1 km)

to Experiment with Blue Gene/L

  • Characteristics:

    • 2048 Processors/5.7 TF

    • PPC 440 (750 MHz)

    • Two processors/node

    • 512 MB memory per node

    • 6 TB file system

Dr. Henry Tufoand myself with “frost”(2005)



Current high resolution ccsm runs
Current high resolution CCSM runs Modeling

  • 0.25 ATM,LND + 0.1 OCN,ICE [ATLAS/LLNL]

    • 3280 processors

    • 0.42 simulated years/day (SYPD)

    • 187K CPU hours/year

  • 0.50 ATM,LND + 0.1 OCN,ICE [FRANKLIN/NERSC]

    • Current

      • 5416 processors

      • 1.31 SYPD

      • 99K CPU hours/year

    • “Efficiency Goal

      • 4932 processors

      • 1.80 SYPD

      • 66K CPU hours/year


Current 0 5 ccsm fuel efficient configuration franklin

168 sec. Modeling

120 sec.

52 sec.

ATM

[np=1664]

CPL

[np=384]

LND

[np=16]

ICE

[np=1800]

21 sec.

91 sec.

5416 processors

Current 0.5 CCSM “fuel efficient” configuration [franklin]

OCN

[np=3600]


Efficiency issues in current 0 5 ccsm configuration

168 sec. Modeling

120 sec.

52 sec.

ATM

[np=1664]

OCN

[np=3600]

CPL

[np=384]

LND

[np=16]

ICE

[np=1800]

21 sec.

91 sec.

5416 processors

Efficiency issues in current 0.5 CCSM configuration

Use Space Filling Curves (SFC) in POP, reduce processor count by 13%.


Load balancing partitioning with space filling curves
Load Balancing: Partitioning with Space Filling Curves Modeling

Partition for 3 processors


Space filling curve partitioning for ocean model running on 8 processors
Space-filling Curve Partitioning Modelingfor Ocean Model running on 8 Processors

Static Load Balancing…

Key concept: no need to compute over land!


Ocean Model 1/10 Degree performance Modeling

Key concept: You need routine access to > 1k procs to discover true scaling behaviour!


Efficiency issues in current ccsm 0 5 configuration

168 sec. Modeling

120 sec.

52 sec.

ATM

[np=1664]

OCN

[np=3600]

CPL

[np=384]

LND

[np=16]

ICE

[np=1800]

21 sec.

91 sec.

5416 processors

Efficiency issues in Current CCSM 0.5 configuration

Use wSFC in CICE, reduce Execution time by 2x.


Static weighted load balancing example sea ice model cice4 @ 1 on 20 processors

Large domains @ low latitudes Modeling

Small domains @ high latitudes

Static, Weighted Load Balancing Example:Sea Ice Model CICE4 @ 1° on 20 processors

Courtesy of John Dennis


Efficiency issues in current 0 5 ccsm configuration coupler

168 sec. Modeling

120 sec.

52 sec.

ATM

[np=1664]

OCN

[np=3600]

CPL

[np=384]

LND

[np=16]

ICE

[np=1800]

21 sec.

91 sec.

5416 processors

Efficiency issues in current 0.5 CCSM configuration: Coupler

Unresolved scalability issues in Coupler – Options: Better interconnect,Nested grids,

PGAS language paradigm


Efficiency issues in current 0 5 ccsm configuration atmospheric component

168 sec. Modeling

120 sec.

52 sec.

ATM

[np=1664]

OCN

[np=3600]

CPL

[np=384]

LND

[np=16]

ICE

[np=1800]

21 sec.

91 sec.

5416 processors

Efficiency issues in current 0.5 CCSM configuration: atmospheric component

Scalability limitation in 0.5° fv-CAM[MPI] – shift to hybrid OpenMP/MPI version


Projected 0 5 ccsm capability configuration 3 8 years day

62 sec. Modeling

62 sec.

31 sec.

ATM

[np=5200]

CPL

[np=384]

LND

[np=40]

ICE

[np=8120]

21 sec.

10 sec.

19460 processors

Projected 0.5 CCSM “capability” configuration: 3.8 years/day

OCN

[np=6100]

Action: Run hybrid atmospheric model


Projected 0 5 ccsm capability configuration version 2 3 8 years day

62 sec. Modeling

62 sec.

31 sec.

ATM

[np=5200]

CPL

[np=384]

LND

[np=40]

ICE

[np=8120]

21 sec.

10 sec.

14260 processors

Projected 0.5 CCSM “capability” configuration - version 2: 3.8 years/day

OCN

[np=6100]

Action: Thread ice model


Scalable geometry choice cube sphere

Ne=16 Cube Sphere Modeling

Showing degree of

non-uniformity

Scalable Geometry Choice: Cube-Sphere

  • Sphere is decomposed into 6 identical regions using a central projection (Sadourny, 1972) with equiangular grid (Rancic et al., 1996).

  • Avoids pole problems, quasi-uniform.

  • Non-orthogonal curvilinear coordinate system with identical metric terms


Scalable numerical method high order methods
Scalable Numerical Method: ModelingHigh-Order Methods

  • Algorithmic Advantages of High Order Methods

    • h-p element-based method on quadrilaterals (Ne x Ne)

    • Exponential convergence in polynomial degree (N)

  • Computational Advantages of High Order Methods

    • Naturally cache-blocked N x N computations

    • Nearest-neighbor communication between elements (explicit)

    • Well suited to parallel µprocessor systems


Homme computational mesh
HOMME: Computational Mesh Modeling

  • Elements:

    • A quadrilateral “patch” of N x N gridpoints

    • Gauss-Lobatto Grid

    • Typically N={4-8}

  • Cube

    • Ne = Elements on an edge

    • 6 x Ne x Ne elements total


Partitioning a cube sphere on 8 processors
Partitioning a cube-sphere on Modeling8 processors


Partitioning a cubed sphere on 8 processors
Partitioning a cubed-sphere on Modeling8 processors


Aqua planet cam homme dycore
Aqua-Planet CAM/HOMME Dycore Modeling

Full CAM Physics/HOMME Dycore

Parallel I/O library used for physics aerosol input and input data

( work COULD NOT have been done without Parallel IO)

Work underway to couple to other CCSM components

5 years/day


Projected 0 25 ccsm capability configuration version 2 4 0 years day

60 sec. Modeling

60 sec.

47 sec.

HOMME ATM

[np=24000]

CPL

[np=3840]

LND

[np=320]

ICE

[np=16240]

8 sec.

5 sec.

30000 processors

Projected 0.25 CCSM “capability” configuration - version 2: 4.0 years/day

OCN

[np=6000]

Action: insert scalable atmospheric dycore


Using a bigger parallel machine can t be the only answer
Using a bigger parallel machine Modelingcan’t be the only answer

  • Progress in the Top 500 list is not fast enough

  • Amdahl’s Law is formidable opponent

  • Dynamical timestep goes like N-1

    • Merciless effect of Courant limit

    • The cost of dynamics relative to physics increases as N

    • e.g. if dynamics takes 20% at 25 km it will take 86% of the time at 1 km

  • Traditional parallelization of horizontal leaves N2 per thread cost (vertical x horizontal)

    • Must inevitably slow down with stalled thread speeds


Options for application acceleration1
Options for Application Acceleration Modeling

  • Scalability

    • Eliminate bottlenecks

    • Find more parallelism

    • Load balancing algorithms

  • Algorithmic Acceleration

    • Bigger Timesteps

      • Semi-Lagrangian Transport

      • Implicit or semi-implicit time integration – solvers

    • Fewer Points

      • Adaptive Mesh Refinement methods

  • Hardware Acceleration

    • More Threads

      • CMP, GP-GPU’s

    • Faster threads

      • device innovations (high-K)

    • Smarter threads

      • Architecture - old tricks, new tricks… magic tricks

        • Vector units, GPU’s, FPGA’s


Accelerator research
Accelerator Research Modeling

  • Graphics Cards – Nvidia 9800/Cuda

    • Measured 109x on WRF microphysics on 9800GX2

  • FPGA – Xilinx (data flow model)

    • 21.7x simulated on sw-radiation code

  • IBM Cell Processor - 8 cores

  • Intel Larrabee


Dg nh amr
DG+NH+AMR Modeling

  • Curvilinear elements

  • Overhead of parallel AMR at each time-step: less than 1%

Idea based on Fischer, Kruse, Loth (02)

Courtesy of Amik St. Cyr


Slim ocean model
SLIM ocean model Modeling

  • Louvain la Neuve University

  • DG, implicit, AMR unstructured

To be coupled to prototype unstructured ATM model

(Courtesy of J-F Remacle LNU)


Ncar summer internships in parallel computational science siparcs 2007 2008
NCAR Summer Internships in Parallel Computational Science (SIParCS)2007-2008

  • Open to:

    • Upper division undergrads

    • Graduate students

  • In Disciplines such as:

    • CS, Software Engineering

    • Applied Math, Statistics

    • ES Science

  • Support:

    • Travel, Housing, Per diem

    • 10 weeks salary

  • Number of interns selected:

    • 7 in 2007

    • 11 in 2008

http://www.cisl.ucar.edu/siparcs



The size of the interdisciplinary interagency team working on climate scalability

Contributors: (SIParCS)

D. Bailey (NCAR)

F. Bryan (NCAR)

T. Craig (NCAR)

A. St. Cyr (NCAR)

J. Dennis (NCAR)

J. Edwards (IBM)

B. Fox-Kemper (MIT,CU)

E. Hunke (LANL)

B. Kadlec (CU)

D. Ivanova (LLNL)

E. Jedlicka (ANL)

E. Jessup (CU)

R. Jacob (ANL)

P. Jones (LANL)

S. Peacock (NCAR)

K. Lindsay (NCAR)

W. Lipscomb (LANL)

R. Loy (ANL)

J. Michalakes (NCAR)

A. Mirin (LLNL)

M. Maltrud (LANL)

J. McClean (LLNL)

R. Nair (NCAR)

M. Norman (NCSU)

T. Qian (NCAR)

M. Taylor (SNL)

H. Tufo (NCAR)

M. Vertenstein (NCAR)

P. Worley (ORNL)

M. Zhang (SUNYSB)

Funding:

DOE-BER CCPP Program Grant

DE-FC03-97ER62402

DE-PS02-07ER07-06

DE-FC02-07ER64340

B&R KP1206000

DOE-ASCR

B&R KJ0101030

NSF Cooperative Grant NSF01

NSF PetaApps Award

Computer Time:

Blue Gene/L time:

NSF MRI Grant

NCAR

University of Colorado

IBM (SUR) program

BGW Consortium Days

IBM research (Watson)

LLNL

Stony Brook & BNL

CRAY XT3/4 time:

ORNL

Sandia

The Size of the Interdisciplinary/Interagency Team Working on Climate Scalability


Thanks! (SIParCS) Any Questions?


Q if you had a petascale computer what would you do with it
Q. If you had a petascale computer (SIParCS)what would you do with it?

A. Use it as a prototype of an exascale computer.


ad