Reformulating the WRF Model for Graphics Processors

Reformulating the WRF Model for Graphics Processors By John Ciolek Local-scale NWP on an $5K PC? 16th Meeting of the DMCC

May 4, 2009

Video Gaming Industry • Estimated size of the gaming industry • 2005: $31.3 Billion • 2006: $36.1 Billion • 2007: $42.8 Billion • Trend toward more realistic images • Requires more powerful rendering hardware • Created explosive growth in graphics processors May 4, 2009

Graphics Cards • Meant to plug into standard computer bus • Control rendering of pixels, voxels, facets, etc. • Controlled by the central processing unit (CPU) • Contain many processors • Graphics Processing Unit (GPU) (similar to CPU) • Stream processing • Input set of data (stream) • Kernel operates on the stream • Performs one or more operations May 4, 2009

GPUs • Maximize number of processors • Minimize cache and control structures May 4, 2009

Memory Access • Relies on localized memory • Slower access to main system memory • Note how threads are organized: • Grids • Blocks • Threads May 4, 2009

Programmer Accessibility • Vendors created Application Programming Interfaces (APIs) • Programmers can access GPU’s capabilities • Graphics card programming languages • Vendor specific • CUDA, Brook, Cell • Generic • OpenCL • GPUs gained more programmer functionality • BLAS, FFT, PhysX May 4, 2009

Explosive Growth in GPU Cores and Performance May 4, 2009

Price/Performance Explosion NVIDIA Tesla 960 Cores Playstation 3 Cluster - 8 PS3s Earth Simulator 5120 procs Blue Gene/L 65,536 procs TeraFLOPS/$Million Roadrunner 19,440 procs Cray 1 1 proc ASCI Red 4,510 procs Cray Y-MP 8 procs May 4, 2009

Current GPU Cost Examples May 4, 2009

Serious Experimenters • 23.2 TeraFLOPS! • Running Folding@home • 6,240 streaming processors • 13 GTX 295 graphics cards • 14 CPU cores • Cost ~ $15,000 May 4, 2009

Serious Science • Astrophysics • Electrodynamics • Life sciences • Nanotechnology simulations • Computational fluid dynamics • Finance • Chemistry • Molecular dynamics • Etc. May 4, 2009

The WRF Connection • John Michalakes (NCAR) • Formulating & optimizing WRF • Group working on reformulating WRF for GPUs • Mostly for CUDA on NVIDIA cards • Claim: “Most recent performance improvements came from CPU speed increases” • No recoding was required • This will not continue to be the case May 4, 2009

What’s the Catch? • Need to identify segments of code that can be reformulated for stream processing • Recode those segments • Recompile & link (with optimize switches) • Must manage memory access • Machine specific • Need to use limited instruction set • CUDA allows upward portability on NVIDIA devices May 4, 2009

WRF Reformulation Process • Identify target WRF packages • Benchmark performance of current coding • Identify quick improvement actions • Using CUDA compiler switches • CUDA intrinsic functions • FORTRAN to C conversion • Rewrite code • Rethink how to implement algorithms • Will take the most time • Revalidate May 4, 2009

Early Successes • Early work on microphysics kernel • 0.4% of code • 25% of elapsed time • Results: • 5 to 20 x increase for this kernel • Translates to 1.25 to 1.3 x overall improvement • Limited by Amdahl’s Law • Based on simple rewrite • Did not attempt CUDA optimizations May 4, 2009

Microphysics Kernel Improvements • Compiler switch: use_fast_math • Eliminated temporary array storage • Graph is based on recent results (March 2009) May 4, 2009

Other Key Findings • Need to: • Reduce transfers between memories • Maximize number of threads actively running • Enhance fine-grained parallelism • Supports “strong-scaling” • N times more threads ~ N times better performance • Explore hardware-specific optimization • Work is continuing on WRF rewrite • Next WRF release will have GPU switch • Need additional help from community May 4, 2009

Target WRF Kernels • Single Moment 5 Cloud Microphysics • 5th Order Positive Definite Tracer Advection • KPP-generated Chemical-kinetics Solver • Long-wave Radiation Physics • Short-wave Radiation Physics May 4, 2009

Quote: • “I wouldn’t recommend groups go out and buy GPU clusters just yet (to run WRF), but maybe by the end of the year…” • John Michalakes May 4, 2009

The Beginning… John Ciolek jciolek@alphatrac.com http://www.mmm.ucar.edu/wrf/WG2/GPU/ http://www.nvidia.com/page/home.html May 4, 2009

Reformulating the WRF Model for Graphics Processors

Reformulating the WRF Model for Graphics Processors

Presentation Transcript

Graphics processors

A Validation Methodology for Graphics Processors

Cryptography on Graphics Processors

A Practical Quicksort Algorithm for Graphics Processors

WRF Model: Software Architecture

Graphics processors

Programming Massively Parallel Graphics Processors

WRF model exercise

Towards a Software Transactional Memory for Graphics Processors

An Evaluation of Graphics Processors as Stream Co-Processors

Reformulating Paraphrasing

Opportunities for WRF Model Acceleration

Graphics Processors

Feature Calibration Alignment for the WRF model WRF Users Workshop, 2012

Parallel Computing on Graphics Processors

NCL for WRF Model Output

Test Cases for the WRF Mass Coordinate Model

The WRF Model: 2012 Annual Update

WRF-Smoke Dispersion Model

WRF Model: Physics Implementation

WRF Model: Physics Implementation

A Validation Methodology for Graphics Processors