Optimizing Radix-2 FFT Algorithm for PowerPC Architecture

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture 9/24/2007 PEREZ-SEVA Jean-Paul, THALES Computers

The authors • Jean-Paul Perez-Seva • Thales Computers - www.thalescomputers.com • Serge Tissot • Thales Computers - www.thalescomputers.com • Michel Cosnard • INRIA - www.inria.fr • Ghislain Oudinet • ISEN - www.isen.fr • Joseph Eicher • Thales Computers - www.thalescomputers.com

Objectives

Objectives • Highlight the advantages of PowerPC architectures • Smart and complete ISA • Powerful SIMD engine • Optimize a Fast Fourier Transform algorithm • Single-precision complex FFT algorithm • Show the efficiency of the Radix-2 algorithm • Its strengths • Its weaknesses • Find best fit FFT algorithm to architecture

PowerPC Architecture

PowerPC Architecture: 970 case www.arstechnica.com

PowerPC Architecture: 970 case • Superscalar architecture • Up to 5 instructions dispatched per cycle • Out of order execution • PowerPC SIMD instructions close to DSP • SIMD instructions for intensive tasks • 4 way single-precision floating-point vector • Efficient instructions for signal processing • Fused Multiplication Addition instruction (FMA) • Permutation instructions • Up to 16 Gflops at 2 GHz

Fast Fourier Transform Algorithm

FFT Algorithm • The choice of the FFT algorithm • Radix-2 algorithm as best candidate • Efficient • Simple • Regular • Radix-2 complexity of calculation • 5NlogN operations • logN steps • N/2 butterflies per step • 10 floating-point operations per butterfly

Our implementation • The optimization comes in three steps • The butterfly optimization • Reduce the number of operations • Or reduce the number of instructions • The SIMDization • Single-precision floating point calculations • 4 way parallelism • The pipeline • Smartly order instructions • Latency • Throughput • Dispatch process

The radix-2 butterfly • A Radix-2 butterfly: • 10 floating point operations • 4 multiplications • 6 additions • 8 instructions • 4 multiplications additions, 2 additions & 2 subtractions • 8 multiplications additions

The radix-2 butterfly • Our optimization: • New formula expression: • Only 6 fused multiplication addition instructions

SIMDization • How to use the SIMD engine from PowerPC • Find data parallelism for at least 4 data points • All the butterflies from the same step are independent each other • Choice of complex data representation • Split complex data arrays • Real and Imaginary values stored in separate arrays • Single-precision floating-point

SIMDization After bit reversing Example of 16-point FFT processing

SIMDization step (n-1) step n Independent blocks of calculation before the last 2 steps

SIMDization Each point of the 4 blocks corresponds to 4 aligned data points

SIMDization • Benefits of this solution are : • Less permutations • Data flows in right order for direct SIMD processing • No permutation until the last two steps (step n-1 and n) • Bit reversing less problematic • No bit reverse to do in a vector • Only bit reverse with vector granularity • 4 times less indexes to calculate

Pipelining • Some rules to follow in order to produce good code • Do not reuse data before the latency of previous generating instruction expires • Preserve the register resources • Superscalar behavior • Examples of possible parallel dispatch and execution • SIMD Load / SIMD FMA / Integer add • Load / SIMD permutation / SIMD FMA / Integer add • Out of order execution is not a concern

Pipelining • Tips: Group steps 2 by 2 • Butterflies are composed of: • 6 fused multiplication addition instructions • 4 load and 4 store instructions • Memory accesses can’t be hidden behind calculations • Keep results as long as possible in registers • Eliminate unnecessary memory accesses • By grouping steps by 2 • 24 fused multiplication addition instructions • 8 load and 8 store instructions

Performances and results

Performances and results • Theoretical performance: • Reduce more than 8 times the number of instructions • Could achieve up to 13,3GFlops at 2GHz if SIMD Floating-point pipeline kept saturated • After implementation: • 10Gflops at 2GHz for 256 and 1k points • Better performances than other optimized PowerPC FFT libraries

Performances and results • Comparison with vDSP Library on 2GHz PPC970 FX GFlops Complex Data points

Conclusions

Conclusions • Radix-2 FFT algorithm is a good fit for SIMD architecture with fused multiplication addition instructions • Improvements • Lack of automatic instruction ordering tool during design • Efficiency limited by the lack of registers • Solved in the Cell SPE architecture • Stockham FFT algorithm similar to radix-2 without bit reverse

Thank you • Any question ?

Back up Slides

Back up slide 1 • Why do we choose the Radix-2 FFT Algorithm • 5 main FFT algorithm selection criteria • The butterfly expressed in multiplication addition • The parallelism capacity • The register consumption • The unrolling freedom • The pipeline behavior

Back up slide 2

Back up slide 3 • By dealing with independent butterflies • After grouping steps by 2

Back up slide 4 • Standard Bit Reversing • After SIMD implementation

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture

Optimizing Radix-2 FFT Algorithm for PowerPC Architecture

Presentation Transcript

PowerPC (RISC) Architecture

POWERPC

PowerPC 74xx Architecture 32-Bit Addressing Modes

PowerPC Architecture and Assembly Language

POWERPC ARCHITECTURE

A Systolic FFT Architecture for Real Time FPGA Systems

Optimizing genetic algorithm strategies for evolving networks

Reflecting Policy Limits in the FFT Algorithm

PowerPC 750

PowerPC

Reconfigurable FFT architecture

A HIGH PERFORMANCE VLSI FFT ARCHITECTURE

The PowerPC Architecture

Reconfigurable FFT architecture

Porting Plan 9 to the PowerPC Architecture

Data Aware Low Power High Throughput Modified Radix Fft Processor for Wpan Applications

RADIX

RADIX

RADIX

A Systolic FFT Architecture for Real Time FPGA Systems

PowerPC

POWERPC