1 / 21

Reformulating the WRF Model for Graphics Processors

Reformulating the WRF Model for Graphics Processors By John Ciolek Local-scale NWP on an $5K PC? 16th Meeting of the DMCC Video Gaming Industry Estimated size of the gaming industry 2005: $31.3 Billion 2006: $36.1 Billion 2007: $42.8 Billion Trend toward more realistic images

libitha
Download Presentation

Reformulating the WRF Model for Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reformulating the WRF Model for Graphics Processors By John Ciolek Local-scale NWP on an $5K PC? 16th Meeting of the DMCC

  2. May 4, 2009

  3. Video Gaming Industry • Estimated size of the gaming industry • 2005: $31.3 Billion • 2006: $36.1 Billion • 2007: $42.8 Billion • Trend toward more realistic images • Requires more powerful rendering hardware • Created explosive growth in graphics processors May 4, 2009

  4. Graphics Cards • Meant to plug into standard computer bus • Control rendering of pixels, voxels, facets, etc. • Controlled by the central processing unit (CPU) • Contain many processors • Graphics Processing Unit (GPU) (similar to CPU) • Stream processing • Input set of data (stream) • Kernel operates on the stream • Performs one or more operations May 4, 2009

  5. GPUs • Maximize number of processors • Minimize cache and control structures May 4, 2009

  6. Memory Access • Relies on localized memory • Slower access to main system memory • Note how threads are organized: • Grids • Blocks • Threads May 4, 2009

  7. Programmer Accessibility • Vendors created Application Programming Interfaces (APIs) • Programmers can access GPU’s capabilities • Graphics card programming languages • Vendor specific • CUDA, Brook, Cell • Generic • OpenCL • GPUs gained more programmer functionality • BLAS, FFT, PhysX May 4, 2009

  8. Explosive Growth in GPU Cores and Performance May 4, 2009

  9. Price/Performance Explosion NVIDIA Tesla 960 Cores Playstation 3 Cluster - 8 PS3s Earth Simulator 5120 procs Blue Gene/L 65,536 procs TeraFLOPS/$Million Roadrunner 19,440 procs Cray 1 1 proc ASCI Red 4,510 procs Cray Y-MP 8 procs May 4, 2009

  10. Current GPU Cost Examples May 4, 2009

  11. Serious Experimenters • 23.2 TeraFLOPS! • Running Folding@home • 6,240 streaming processors • 13 GTX 295 graphics cards • 14 CPU cores • Cost ~ $15,000 May 4, 2009

  12. Serious Science • Astrophysics • Electrodynamics • Life sciences • Nanotechnology simulations • Computational fluid dynamics • Finance • Chemistry • Molecular dynamics • Etc. May 4, 2009

  13. The WRF Connection • John Michalakes (NCAR) • Formulating & optimizing WRF • Group working on reformulating WRF for GPUs • Mostly for CUDA on NVIDIA cards • Claim: “Most recent performance improvements came from CPU speed increases” • No recoding was required • This will not continue to be the case May 4, 2009

  14. What’s the Catch? • Need to identify segments of code that can be reformulated for stream processing • Recode those segments • Recompile & link (with optimize switches) • Must manage memory access • Machine specific • Need to use limited instruction set • CUDA allows upward portability on NVIDIA devices May 4, 2009

  15. WRF Reformulation Process • Identify target WRF packages • Benchmark performance of current coding • Identify quick improvement actions • Using CUDA compiler switches • CUDA intrinsic functions • FORTRAN to C conversion • Rewrite code • Rethink how to implement algorithms • Will take the most time • Revalidate May 4, 2009

  16. Early Successes • Early work on microphysics kernel • 0.4% of code • 25% of elapsed time • Results: • 5 to 20 x increase for this kernel • Translates to 1.25 to 1.3 x overall improvement • Limited by Amdahl’s Law • Based on simple rewrite • Did not attempt CUDA optimizations May 4, 2009

  17. Microphysics Kernel Improvements • Compiler switch: use_fast_math • Eliminated temporary array storage • Graph is based on recent results (March 2009) May 4, 2009

  18. Other Key Findings • Need to: • Reduce transfers between memories • Maximize number of threads actively running • Enhance fine-grained parallelism • Supports “strong-scaling” • N times more threads ~ N times better performance • Explore hardware-specific optimization • Work is continuing on WRF rewrite • Next WRF release will have GPU switch • Need additional help from community May 4, 2009

  19. Target WRF Kernels • Single Moment 5 Cloud Microphysics • 5th Order Positive Definite Tracer Advection • KPP-generated Chemical-kinetics Solver • Long-wave Radiation Physics • Short-wave Radiation Physics May 4, 2009

  20. Quote: • “I wouldn’t recommend groups go out and buy GPU clusters just yet (to run WRF), but maybe by the end of the year…” • John Michalakes May 4, 2009

  21. The Beginning… John Ciolek jciolek@alphatrac.com http://www.mmm.ucar.edu/wrf/WG2/GPU/ http://www.nvidia.com/page/home.html May 4, 2009

More Related