Reformulating the WRF Model for Graphics Processors - PowerPoint PPT Presentation

libitha
reformulating the wrf model for graphics processors l.
Skip this Video
Loading SlideShow in 5 Seconds..
Reformulating the WRF Model for Graphics Processors PowerPoint Presentation
Download Presentation
Reformulating the WRF Model for Graphics Processors

play fullscreen
1 / 21
Download Presentation
Reformulating the WRF Model for Graphics Processors
1448 Views
Download Presentation

Reformulating the WRF Model for Graphics Processors

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Reformulating the WRF Model for Graphics Processors By John Ciolek Local-scale NWP on an $5K PC? 16th Meeting of the DMCC

  2. May 4, 2009

  3. Video Gaming Industry • Estimated size of the gaming industry • 2005: $31.3 Billion • 2006: $36.1 Billion • 2007: $42.8 Billion • Trend toward more realistic images • Requires more powerful rendering hardware • Created explosive growth in graphics processors May 4, 2009

  4. Graphics Cards • Meant to plug into standard computer bus • Control rendering of pixels, voxels, facets, etc. • Controlled by the central processing unit (CPU) • Contain many processors • Graphics Processing Unit (GPU) (similar to CPU) • Stream processing • Input set of data (stream) • Kernel operates on the stream • Performs one or more operations May 4, 2009

  5. GPUs • Maximize number of processors • Minimize cache and control structures May 4, 2009

  6. Memory Access • Relies on localized memory • Slower access to main system memory • Note how threads are organized: • Grids • Blocks • Threads May 4, 2009

  7. Programmer Accessibility • Vendors created Application Programming Interfaces (APIs) • Programmers can access GPU’s capabilities • Graphics card programming languages • Vendor specific • CUDA, Brook, Cell • Generic • OpenCL • GPUs gained more programmer functionality • BLAS, FFT, PhysX May 4, 2009

  8. Explosive Growth in GPU Cores and Performance May 4, 2009

  9. Price/Performance Explosion NVIDIA Tesla 960 Cores Playstation 3 Cluster - 8 PS3s Earth Simulator 5120 procs Blue Gene/L 65,536 procs TeraFLOPS/$Million Roadrunner 19,440 procs Cray 1 1 proc ASCI Red 4,510 procs Cray Y-MP 8 procs May 4, 2009

  10. Current GPU Cost Examples May 4, 2009

  11. Serious Experimenters • 23.2 TeraFLOPS! • Running Folding@home • 6,240 streaming processors • 13 GTX 295 graphics cards • 14 CPU cores • Cost ~ $15,000 May 4, 2009

  12. Serious Science • Astrophysics • Electrodynamics • Life sciences • Nanotechnology simulations • Computational fluid dynamics • Finance • Chemistry • Molecular dynamics • Etc. May 4, 2009

  13. The WRF Connection • John Michalakes (NCAR) • Formulating & optimizing WRF • Group working on reformulating WRF for GPUs • Mostly for CUDA on NVIDIA cards • Claim: “Most recent performance improvements came from CPU speed increases” • No recoding was required • This will not continue to be the case May 4, 2009

  14. What’s the Catch? • Need to identify segments of code that can be reformulated for stream processing • Recode those segments • Recompile & link (with optimize switches) • Must manage memory access • Machine specific • Need to use limited instruction set • CUDA allows upward portability on NVIDIA devices May 4, 2009

  15. WRF Reformulation Process • Identify target WRF packages • Benchmark performance of current coding • Identify quick improvement actions • Using CUDA compiler switches • CUDA intrinsic functions • FORTRAN to C conversion • Rewrite code • Rethink how to implement algorithms • Will take the most time • Revalidate May 4, 2009

  16. Early Successes • Early work on microphysics kernel • 0.4% of code • 25% of elapsed time • Results: • 5 to 20 x increase for this kernel • Translates to 1.25 to 1.3 x overall improvement • Limited by Amdahl’s Law • Based on simple rewrite • Did not attempt CUDA optimizations May 4, 2009

  17. Microphysics Kernel Improvements • Compiler switch: use_fast_math • Eliminated temporary array storage • Graph is based on recent results (March 2009) May 4, 2009

  18. Other Key Findings • Need to: • Reduce transfers between memories • Maximize number of threads actively running • Enhance fine-grained parallelism • Supports “strong-scaling” • N times more threads ~ N times better performance • Explore hardware-specific optimization • Work is continuing on WRF rewrite • Next WRF release will have GPU switch • Need additional help from community May 4, 2009

  19. Target WRF Kernels • Single Moment 5 Cloud Microphysics • 5th Order Positive Definite Tracer Advection • KPP-generated Chemical-kinetics Solver • Long-wave Radiation Physics • Short-wave Radiation Physics May 4, 2009

  20. Quote: • “I wouldn’t recommend groups go out and buy GPU clusters just yet (to run WRF), but maybe by the end of the year…” • John Michalakes May 4, 2009

  21. The Beginning… John Ciolek jciolek@alphatrac.com http://www.mmm.ucar.edu/wrf/WG2/GPU/ http://www.nvidia.com/page/home.html May 4, 2009