Development and Acceleration of Parallel Chemical Transport Models

Paul Eller Development and Acceleration of Parallel Chemical TransportModels

Overview • Motivation • GEOS-Chem KPP • GPU STEM • Background • Transport • Chemistry • Results • Conclusions

Motivation • Industrialism produced many chemicals affecting the atmosphere. • Scientists have studied the atmosphere and developed mathematical models. • Multiprocessor systems provide access to large amounts of computing power. • Need fast and accurate tools to produce useful simulations.

GEOS-Chem A state-of-the-science global 3-D model of atmospheric composition. Used for activities such as: Assessing intercontinental transport of pollution. Evaluating air quality. Investigations of tropospheric chemistry. Uses SMVGEARII as native chemistry solver. KPP Software tool to assist computer simulations of chemical kinetic systems. Provides: A simple natural language to describe chemical mechanism. Comprehensive library of forward and adjoint integrators. Uses preprocessing step to generate efficient output code. GEOS-Chem and KPP

GEOS-Chem KPP • GEOS-Chem using KPP for chemical integration. • Uses two Perl parsers and a modified version of KPP to interface GEOS-Chem and KPP.

Chemical Mechanism • SMVGEARII “globchem.dat” contains: • chemical species list • chemical reactions list • KPP requires a description of the chemical mechanism in terms of: • chemical species (*.spc)‏ • chemical equations (*.eqn)‏ • mechanism definitions (.def)‏ • The geos2kpp_parser.pl Perl parser translates chemical mechanism from SMVGEARII to KPP format.

Interfacing GEOS-Chem and KPP • Modifying KPP: • Uses KPP input files generated by parser. • Added #GEOSCHEM input command. • Produces code with GEOS-Chem interface. • Modifying GEOS-Chem: • Use gckpp_parser.pl. • Shuffling routines map GEOS-Chem concentrations to KPP concentrations and vice versa. • Copy rate constants from GEOS-Chem to KPP.

SMVGEARII and Rosenbrock Comparison • Difference between computed Ox concentrations (ppbv) for a 48 hour simulation. • Scatterplot of Ox concentrations (molecules/cm3) for a one week simulation.

Work-precision Diagram • Work-precision diagram (Significant Digits of Accuracy versus run time) for 7 day chemistry-only simulation for RTOL=1.0E-1, 3.0E-2, 1.0E-2, 3.0E-3, and 1.0E-3.

Speedup Plot • Speedup plot for Rosenbrock Rodas3, Rosenbrock Rodas4, SMVGEARII, Runge-Kutta, and Sdirk for a 7 day chemistry-only simulation.

Sulfur Transport Eulerian Model (STEM) • A simplified CTM to experiment with new numerical methods and computing technologies. • Uses KPP generated chemical model. • Simplified by removing MPI calls and common blocks, and reading check pointed data. • Developed C versions of chemistry and transport. • Numerical Solution:

Graphics Processing Units (GPUs) • Highly parallel manycore processors with high computational horsepower and memory bandwidth. • Development driven by high market demand for high-definition real time graphics. • Capable of running scientific applications quickly and cheaply. • Use GPU programming languages based on traditional programming languages.

Compute Unified Device Architecture (CUDA) • A parallel programming model and software environment to develop scalable parallel applications on GPUs. • Three key abstractions: • Hierarchy of thread groups • Shared memories • Barrier synchronization • Partition problem into subproblems that can be solved independently and pieces that can be solved cooperatively.

Transport • Uses two horizontal routines and one vertical transport routine. • Solves the advection diffusion equation: y’ + ▼(uy) = ▼(k▼y) + b • Where: • ▼(uy) is the advection term. • ▼(k▼y) is the diffusion term. • b is the boundary values. • Written as the linear system: y’ = A · y + b(t)

Implicit Transport • Implicit Crank-Nicholson method: yn+1 = yn + dt (A yn+1 + A yn) + dt (b(tn) + b(tn+1)) 2 2 where: A = Jacobian yn = solution at step n dt = time step b(tn) = free term at step n • Simplified to: (I – dt A) yn+1 = (I + dt A) yn + dt (b(tn) + b(tn+1)) 2 2 2

Explicit Transport • Rk2a method: y(1) = yn + dt · A · yn + dt · b(tn) y(2) = y(1) + dt · A · y(1) + dt · b(tn+1) yn+1 = ½(yn + y(2)) where: A = Jacobian yn = solution at step n dt = time step b(tn) = free term at step n

GPU Transport • Pass all data from CPU to GPU. • Copy data from multi-dimension STEM arrays to 1-D arrays • Transport driver functions: • Initialize CUDA • Copy data between host and device • Set function parameters • Launch GPU kernel • Calculate thread index and load data. • Developed initialization, shuffling, and closing subroutines.

Memory Optimizations • Global memory: • Pass simulation values between CPU and GPU. • Local memory: • Concentration values and temporary calculation data. • Shared memory: • Arrays used by many threads.

Shared Memory Optimizations • Parallel computation of Jacobian and (I – (dt/2)A). • Many threads compute arrays used to compute all concentrations. • Parallel computation of inner transport loop. • Separate threads to compute each iteration. • Parallel computation within inner transport loop. • Allowed multiple threads to compute each iteration of inner transport loop. • Made code run slower.

Rosenbrock Methods • s = number of stages • tn = discrete time moment • h = time step • yn = numerical solution • f() = function • J = Jacobian • A = System matrix • Ti = internal stage time moment • Yi = Internal stage solution • α, a, b, c, e, m = method coefficients

QSSA Methods • Use the split function evaluation y’j = Pj(y) – Dj(y)yj • Where: P(y) = Production term D(y) = Destruction term • A split function evaluation followed by a half-step approximation is computed twice per step.

QSSA Method • Basic approximation: • For very small absolute values of Dj: • For small values of Dj: • For large positive Dj: • Error term: • Step size: Where yn = solution at step n, h = step size, V2 = half step approx, and V1 = full step approx.

QSSA Exp2 Method • Basic approximation: • For very small values of γhA: where γ,α, b, are method coefficients.

GPU Chemistry • Developed rolled loops for function and Jacobian evaluations. • Pass all data from CPU to GPU at once. • Developed chemistry driver functions. • Initialize CUDA • Copy data between host and device • Set function parameters • Launch GPU kernel • Calculate thread index and load data.

Rosenbrock Memory • Global memory: • Copy concentrations from GPU to CPU. • Texture memory: • Pass chemical concentrations, rate constants, and fixed concentrations from CPU to GPU. • Constant memory: • Arrays for rolled function/Jacobian evaluations, method parameters, and integrator parameters. • Local memory: • Holds temporary arrays used to calculate solution.

Rosenbrock Optimizations • Original memory requirements: • GPU memory requirements: • Rolled vs. Unrolled function/Jacobian evaluations: • Unrolled loops • Very large register/local memory requirements, would not compile. • Texture memory compiled, but ran slow. • Rolled loops • Best performance.

QSSA Optimizations • Memory requirements: • Global memory: Chemical concentrations and rate constants. • Local memory: Temporary arrays used to calculate solution. • Rolled vs. Unrolled Function Evaluations • Unrolled loops: • Compiled and easily fit into memory • Rolled loops: • Much slower performance • Can be used with shared memory

Multiple Threads per Grid Cell Up to NVAR (88) threads compute each grid cell. Up to 4 grid cells per thread block. Problems: Limited threads per block, registers, and shared memory. Unbalanced workload for function evaluations. Multiple Kernels Kernels for each loop. Three kernels for each split function evaluation. Many threads per thread block. Problems: Many reads/writes to global memory. Unbalanced workload for function evaluations. Few ways to use shared memory effectively. Shared Memory Optimizations

Transport Parameters Implicit Explicit Threadblock Size Registers

CPU vs. GPU vs. OpenMP • Memory Transfers take about 14ms per iteration. • Occupancy is 28% for all methods.

Transport Results • Results for 1 full iteration (ms): • GPUs reduce running times by a significant amount. • Outperforms OpenMP. • Problems: • Large memory requirements • Non-ideal GPU parameters • Limited use of shared memory

Chemistry Parameters QSSA Rosenbrock Threadblock Size Registers

Chemistry Results • CPU vs. GPU running times (seconds): • GPU vs. OpenMP running times (seconds):

Chemistry Results • QSSA methods achieve higher speedup than Rosenbrock methods on GPU. • Rosenbrock: • Large memory footprint • Many function/Jacobian evaluations • Memory bound. • QSSA: • Small memory footprint • Simpler code structure • Computation bound

Full STEM Results

Full STEM Results • 6hr simulation (seconds): • Integrator Setup/Close: 127.20 seconds • Transport Setup/Close: 2.40 seconds

Accuracy of STEM Rodas-3/Rodas-4 QSSA Exp2/QSSA QSSA/Rodas-4 QSSA Exp2/Rodas-4

Accuracy of STEM Rodas-4 (Scale 0.00-0.03) QSSA (Scale 0.00-0.45) Difference (Scale 0.00-0.45)

Conclusions • KPP Solvers produce accurate results for GEOS-Chem that scale well. • KPP Rodas3 and Rodas4 achieve a similar level of accuracy at a lower computational expense than SMVGEARII. • GPUs provide significant potential for accelerating CTMs. • Tradeoff between placing more data in fast memories or achieving high occupancies. • OpenMP STEM outperforms GPU STEM.

Future Work • Develop accurate methods with a small memory footprint. • Simpler algorithms with a smaller memory footprint perform better. • Develop GPU versions of larger state-of-the-science CTMs such as GEOS-Chem. • Revisit this work after faster GPUs such as the GT300 GPUs have been released. • NVIDIA GT300 chips will have MIMD, improved double precision performance, and more memory and registers.

The End • Questions?

Accuracy Of STEM Ground level: Upper Left: Rodas4 (Scale 0.00-18x10-3)‏ Upper Right: QSSA (Scale 0.00-0.35)‏ Lower Right: Difference (Scale 0.00-0.35)‏

Development and Acceleration of Parallel Chemical Transport Models

Development and Acceleration of Parallel Chemical Transport Models

Presentation Transcript

Parallel Programming Models and Paradigms

Parallel Programming Models

Parallel Programming Models

Models of chemical enrichment

Hardware Acceleration of Parallel Prefix Algorithms

Models of Parallel Computation

Parallel Architecture Models

Models of Chemical Bonding

Modeling and Acceleration of File-IO Dominated Parallel Workloads

Models of Parallel Computation

Parallel Programming Models

Parallel-Machine Models

Parallel Computation Models

WG7: Particle acceleration and transport

Parallel models of STM search

4. Atmospheric chemical transport models

Parallel Programming Models

Transport of Chemical Waste

Models of Parallel Processing

Parallel Programming Models and Paradigms

4 Models of Parallel Processing

Models of Parallel Computation