1 / 31

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture. 9/24/2007 PEREZ-SEVA Jean-Paul, THALES Computers. The authors. Jean-Paul Perez-Seva Thales Computers - www.thalescomputers.com Serge Tissot Thales Computers - www.thalescomputers.com Michel Cosnard INRIA - www.inria.fr

jaafar
Download Presentation

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Radix-2 FFT Algorithm for PowerPC Architecture 9/24/2007 PEREZ-SEVA Jean-Paul, THALES Computers

  2. The authors • Jean-Paul Perez-Seva • Thales Computers - www.thalescomputers.com • Serge Tissot • Thales Computers - www.thalescomputers.com • Michel Cosnard • INRIA - www.inria.fr • Ghislain Oudinet • ISEN - www.isen.fr • Joseph Eicher • Thales Computers - www.thalescomputers.com

  3. Objectives

  4. Objectives • Highlight the advantages of PowerPC architectures • Smart and complete ISA • Powerful SIMD engine • Optimize a Fast Fourier Transform algorithm • Single-precision complex FFT algorithm • Show the efficiency of the Radix-2 algorithm • Its strengths • Its weaknesses • Find best fit FFT algorithm to architecture

  5. PowerPC Architecture

  6. PowerPC Architecture: 970 case www.arstechnica.com

  7. PowerPC Architecture: 970 case • Superscalar architecture • Up to 5 instructions dispatched per cycle • Out of order execution • PowerPC SIMD instructions close to DSP • SIMD instructions for intensive tasks • 4 way single-precision floating-point vector • Efficient instructions for signal processing • Fused Multiplication Addition instruction (FMA) • Permutation instructions • Up to 16 Gflops at 2 GHz

  8. Fast Fourier Transform Algorithm

  9. FFT Algorithm • The choice of the FFT algorithm • Radix-2 algorithm as best candidate • Efficient • Simple • Regular • Radix-2 complexity of calculation • 5NlogN operations • logN steps • N/2 butterflies per step • 10 floating-point operations per butterfly

  10. Our implementation • The optimization comes in three steps • The butterfly optimization • Reduce the number of operations • Or reduce the number of instructions • The SIMDization • Single-precision floating point calculations • 4 way parallelism • The pipeline • Smartly order instructions • Latency • Throughput • Dispatch process

  11. The radix-2 butterfly • A Radix-2 butterfly: • 10 floating point operations • 4 multiplications • 6 additions • 8 instructions • 4 multiplications additions, 2 additions & 2 subtractions • 8 multiplications additions

  12. The radix-2 butterfly • Our optimization: • New formula expression: • Only 6 fused multiplication addition instructions

  13. SIMDization • How to use the SIMD engine from PowerPC • Find data parallelism for at least 4 data points • All the butterflies from the same step are independent each other • Choice of complex data representation • Split complex data arrays • Real and Imaginary values stored in separate arrays • Single-precision floating-point

  14. SIMDization After bit reversing Example of 16-point FFT processing

  15. SIMDization step (n-1) step n Independent blocks of calculation before the last 2 steps

  16. SIMDization Each point of the 4 blocks corresponds to 4 aligned data points

  17. SIMDization • Benefits of this solution are : • Less permutations • Data flows in right order for direct SIMD processing • No permutation until the last two steps (step n-1 and n) • Bit reversing less problematic • No bit reverse to do in a vector • Only bit reverse with vector granularity • 4 times less indexes to calculate

  18. Pipelining • Some rules to follow in order to produce good code • Do not reuse data before the latency of previous generating instruction expires • Preserve the register resources • Superscalar behavior • Examples of possible parallel dispatch and execution • SIMD Load / SIMD FMA / Integer add • Load / SIMD permutation / SIMD FMA / Integer add • Out of order execution is not a concern

  19. Pipelining • Tips: Group steps 2 by 2 • Butterflies are composed of: • 6 fused multiplication addition instructions • 4 load and 4 store instructions • Memory accesses can’t be hidden behind calculations • Keep results as long as possible in registers • Eliminate unnecessary memory accesses • By grouping steps by 2 • 24 fused multiplication addition instructions • 8 load and 8 store instructions

  20. Performances and results

  21. Performances and results • Theoretical performance: • Reduce more than 8 times the number of instructions • Could achieve up to 13,3GFlops at 2GHz if SIMD Floating-point pipeline kept saturated • After implementation: • 10Gflops at 2GHz for 256 and 1k points • Better performances than other optimized PowerPC FFT libraries

  22. Performances and results • Comparison with vDSP Library on 2GHz PPC970 FX GFlops Complex Data points

  23. Conclusions

  24. Conclusions • Radix-2 FFT algorithm is a good fit for SIMD architecture with fused multiplication addition instructions • Improvements • Lack of automatic instruction ordering tool during design • Efficiency limited by the lack of registers • Solved in the Cell SPE architecture • Stockham FFT algorithm similar to radix-2 without bit reverse

  25. Thank you • Any question ?

  26. Back up Slides

  27. Back up slide 1 • Why do we choose the Radix-2 FFT Algorithm • 5 main FFT algorithm selection criteria • The butterfly expressed in multiplication addition • The parallelism capacity • The register consumption • The unrolling freedom • The pipeline behavior

  28. Back up slide 2

  29. Back up slide 3 • By dealing with independent butterflies • After grouping steps by 2

  30. Back up slide 4 • Standard Bit Reversing • After SIMD implementation

More Related