1 / 45

High Performance Discrete Fourier Transforms on Graphics Processors

High Performance Discrete Fourier Transforms on Graphics Processors. Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith , John Manferdelli Microsoft Corporation. Discrete Fourier Transforms (DFTs).

Download Presentation

High Performance Discrete Fourier Transforms on Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Discrete Fourier Transforms on Graphics Processors Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith , John Manferdelli Microsoft Corporation

  2. Discrete Fourier Transforms (DFTs) • Given an input signal of N values f(n), project it onto a basis of complex exponentials • Often computed using Fast Fourier Transforms (FFTs) for efficiency • Fundamental primitive for signal processing • Convolutions, cryptography, computational fluid dynamics, large polynomial multiplications, image and audio processing, etc. • A popular HPC benchmark • HPC Challenge benchmark • NAS parallel benchmarks

  3. DFT: Challenges • HPC Challenge 2008 • DFT on Cray XT3: 0.9 TFLOPS • HPL: 17 TFLOPS • Complex memory access patterns • Limited data reuse • For a balanced system, if compute-to-memory ratio doubles, the cache size needs to be squared for the system to be balanced again [Kung86] • Architectural issues • Cache associativity, memory banks

  4. GPU: Commodity Processor Consoles Cell phones PSP Desktops

  5. Parallelism in GPUs NVIDIA GTX280: Peak 1TFLOP performance TPC TPC TPC TPC TPC TPC SP SP SP SP SP SP SP SP SP SP SP SP Local Memory Local Memory Local Memory SP SP SP SP SP SP SP SP SP SP SP SP TPC TPC TPC TPC TPC 140GB/s GPU Memory DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

  6. Programmability Thread Execution Manager TPC TPC TPC TPC TPC TPC TPC TPC GPU Memory SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Block Thread Block Domain Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Regs Regs SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Regs Regs local memory local memory Regs Regs High-level programming abstractions: Microsoft DirectX11, OpenCL, NVIDIA CUDA, AMD CAL, etc. GPU Memory DRAM DRAM DRAM DRAM DRAM DRAM

  7. Discrete Fourier Transforms • Objectives: • Efficiency: Achieve high performance exploiting the memory hierarchy and high parallelism • Accuracy: Design algorithms that achieve comparable numerical accuracy with CPU libraries • Scalability: Demonstrate scalable performance based on underlying hardware capabilities • Focus on computing single-precision DFTs that fit in GPU memory • Demonstrate DFT performance of 100-300 GFLOPS per GPU for typical large sizes • Concepts applicable to double-precision algorithms

  8. FFT Overview

  9. FFT Overview Transpose FFT along columns FFT along rows

  10. Registers (16K) Shared memory(16KB/multi-processor) Globalmemory (1GB) Significant literature on FFT algorithms. Detailed survey in [Van Loan 92]

  11. DFTs on GPUs: Challenges • Coalescing issues • Access contiguous blocks of data to achieve high DRAM bandwidth • Bank conflicts • Affine access patterns can map to same banks • Transpose overheads • Reduce memory access overheads • Occupancy • Require several threads to hide memory latency

  12. Outline • FFT Algorithms • Global Memory • Shared Memory • Hierarchical Memory • Other FFT algorithms • Experimental Results • Conclusions and Future Work

  13. Outline • FFT Algorithms • Global Memory • Shared Memory • Hierarchical Memory • Other FFT algorithms • Experimental Results • Conclusions and Future Work

  14. Overview • Global Memory Algorithm • Large N • Uses high memory bandwidth of GPUs • Shared Memory Algorithm • Small N • Data re-use in shared memory of GPU MPs • Hierarchical Algorithm • Intermediate sizes • Combines data transposes with shared memory algorithm

  15. Global memory algorithm • Proceeds in logRN steps (radix=R) • Decompose N into blocks B, and threads T such that B*T=N/R • Each thread: • reads R values from global memory • multiplies by twiddle factors • performs an R-point FFT • writes R values back to global memory

  16. Global Memory Algorithm N/R R=4 Step j=1 If N/R > coalesce width (CW), no coalescing issues during reads If Rj> CW, no coalescing issues during writes Thread 1 Thread 2 Thread 3 Thread 0 If Rj <=CW, write to shared memory, rearrange data across threads, write to global memory with coalescing Rj

  17. Shared memory algorithm • Applied when FFT is computed on data in shared memory of a MP • Each block has N*M/R threads • M is number of FFTs performed together in a block • Each MP performs M FFTs at a time • Similar to global memory algorithm • Use Stockham formulation to reduce compute overheads

  18. Shared Memory Algorithm N/R R=4 Step j=1 If N/R > numbanks, no bank conflicts during reads If Rj> numbanks, no bank conflicts during writes Thread 1 Thread 2 Thread 3 Thread 0 Rj

  19. Shared Memory Algorithm N/R R=4 Step j=1 Thread 1 Thread 2 Thread 0 Thread 3 Thread 7 Thread 4 Thread 5 Thread 6 If Rj <=numbanks, add padding to avoid bank conflicts 0 0 8 8 4 4 12 12 Banks Rj

  20. Hierarchical FFT • Decompose FFT into smaller-sized FFTs • Evaluate efficiently using shared memory algorithm • Combine transposes with FFT computation • Achieve memory coalescing

  21. Multiprocessor Hierarchical FFT SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP W=N/H SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Perform CW FFTs of size H in shared memory CW Shared Memory H … DRAM DRAM DRAM DRAM DRAM DRAM

  22. Hierarchical FFT W=N/H H Perform H FFTs of size W recursively Transpose In-place algorithm Final set of transposes can also be combined with FFT computation

  23. Other FFTs • Non-Power-of-Two sizes • Mixed Radix • Using powers of 2, 3, 5, etc. • Bluestein’s FFT • For large prime factors • Multi-dimensional FFTs • Perform FFTs independently along each dimension • Real FFTs • Exploit symmetry to improve the performance • Transformed into a complex FFT problem • DCTs • Computed using a transformation to complex FFT problem

  24. Microsoft DFT Library Key features supported in our GPU DFT library

  25. Outline • FFT Algorithms • Global Memory • Shared Memory • Hierarchical Memory • Other FFT algorithms • Experimental Results • Conclusions and Future Work

  26. Experimental Methodology • Hardware • Intel QX9650 3.0 GHz quad-core processor • Two dual core dies • Each pair of cores shares 6 MB L2 cache • NVIDIA GTX280 GPU • Driver version 177.41

  27. Experimental Methodology • Libraries • Our FFT library written in CUDA • Tested on various GPUs • NVIDIA’s CUFFT library (v. 1.1) • Results for GTX280 only • DX9FFT library [Lloyd et al. 2007] • Results for GTX280 only • Intel’s MKL (v. 10.0.2) • Run on CPU with 4 threads

  28. Experimental Methodology • Notation • N: Size of the FFT • M: Number of FFTs • Performance • GFlops: M 5N lg(N) / time • Minimum time over multiple runs • Warm caches on CPU • Accuracy • Perform forward transform and inverse • Compare result to original input • Root mean square error (RMSE) / 2

  29. 1D Single FFT M = 1

  30. 1D Multi-FFT Entire FFT in shared memory kernel 300 250 Ours GTX280* Ours GTX280 200 Ours 8800GTS GFlops 150 CUFFT 100 DX9FFT MKL 50 0 1 3 5 7 9 11 13 15 17 19 21 23 M = 223 / N *Driver 177.11 log N 2

  31. 1D Multi-FFT 40x 20x 5x M = 223 / N

  32. 1D Mixed Radix N = 2a3b5c M= 223/N

  33. 1D Primes M= 220/N

  34. 1D Large Primes M= 222/N

  35. RMSE Error (N=2a)

  36. RMSE Error (Mixed radix)

  37. RMSE Error (primes)

  38. Limitations • Current implementation • Works only on data in GPU memory • No multi-GPU support • No support for double precision • Hardware Issues • Large data sizes needed to fully utilize GPU • Slow data transfer between GPU and system memory • High accuracy twiddle factors are slow • Use a table (especially for double precision) • Need to virtualize block index • Fixed in Microsoft DirectX11

  39. Outline • FFT Algorithms • Global Memory • Shared Memory • Hierarchical Memory • Other FFT algorithms • Experimental Results • Conclusions and Future Work

  40. Conclusions • Several algorithms for performing FFTs on GPUs • Handle different sizes efficiently • Library chooses appropriate algorithms for a given size and hardware configuration • Optimized for memory performance • Combined transposes with FFT computation • Address numerical accuracy issues • High performance • Up to 300 GFLOPS on current high-end GPUs • Significantly faster than existing GPU-based libraries and CPU-based libraries for typical large sizes

  41. Future Work • More sophisticated auto-tuning • Add additional functionality: • Double precision • Multi-GPU support • Out-of-core support for very large FFTs • Port to DirectX11 using Compute Shaders

  42. Future of GPGPU • GPUs are becoming more general purpose • Fewer limitations. Microsoft DirectX11 API: • IEEE floating point support and optional double support • Integer instruction support • More programmable stages, etc. • Significant advance in performance • Higher level programming languages • Uniform abstraction layer over different hardware vendors

  43. Future of GPGPU • Widespread adoption of GPUs in commercial applications • Image and media processing, signal processing, finance, etc. • High performance computing • Can benefit from data-parallel programming • Many opportunities • Microsoft GPU Station at Booth number 1309

  44. Acknowledgments • Microsoft: Chas Boyd, Craig Mundie, Ken Oien • NVIDIA: Henry Moreton, Sumit Gupta, and David • Peter-Pike Sloan • Vasily Volkov

  45. Questions Contact: nagag@microsoft.com brandon.lloyd@microsoft.com

More Related