HPEC 2006 Poster Session B.4 20 September 2006 Ray Andraka, P.E.

Hybrid Floating point Technique yields 1.2 Giga-sample per second 32 to 2048 point floating point FFT in a single FPGA HPEC 2006 Poster Session B.4 20 September 2006 Ray Andraka, P.E. President, Andraka Consulting Group, Inc ray@andraka.com

Floating point addition & subtraction is resource intensive Barrel Shift Denormalize Exchange Network Mantissa Add/Sub Barrel Shift Renormalize Mantissa A Rounding Mantissa Mantissa B ½ Leading Zeros Detect Exponent A Exponent Difference Exponent Exponent B Exponent Adder

Apply floating point to larger functions • Apply floating point to larger functions • Floating point typically applied at add and multiply level operations • Instead construct higher order operations from fixed point operators • Phase rotator • FFT • Apply floating point to those more complicated operators • Denormalize to convert mantissa to fixed point plus common scale • Pass exponent around series of fixed point operations • Renormalize after several operations rather than after each one

Apply floating point to larger functions Barrel Shift Denormalize Barrel Shift Renormalize Mantissas Rounding Fixed point function Mantissa Exponent Difference ½ Leading Zeros Detect Exponents Exponent Max Exponent Exponent Adder

Add requires both addends to have the same scale Radix points must align Addition is inherently fixed point Smaller addend’s mantissa is right shifted until exponent is same as larger Exponent increments each shift Right shift truncates LSBs Truncated LSBs are lost Sum is left shifted to left justify LSBs zero filled No improvement to precision Examples: Different exponents A= 1.101 * 25 B= 1.101 * 23 = 0.01101 * 25 A+B= (1.101 + 0.011) * 25 = (11.000) * 25 LSBs of B are lost Renormalizing A= 1.101 * 25 B= 1.011 * 25 A-B= (1.101 - 1.011) * 25 = (0.010) * 25 = (1.000) * 23 Sum LSBs are filled with 0’s Floating point sum has only as much precision as larger addend

Phase rotation does not change amplitude • Re (y) = re(x) * cos(w) - im(x) * sin(w) • Im(y) = re(x) * sin(w) + im(x) * cos(w) • Magnitudes of individual I and Q components change, but complex magnitude is not altered. • No loss of precision by treating I and Q with common exponent • Complex operation is limited to precision of larger component • Using common exponent for I and Q reduces hardware • Single copy of exponent logic • No rescaling of I with respect to Q • Simplifies rotator • Fixed point complex multiply (smaller of I or Q is denormalized) • Fixed point sines and cosines • Output renormalize is +/-1 bit shift

FFT butterflies are only as precise as largest input • Cooley-Tukey FFT butterfly • Sum and difference of pair of complex inputs • one input is rotated by “twiddle factor” phasor • Rotation does not affect scale • Smaller input right shifted • Shift to match scale • LSBs are lost • Both outputs have same LSB weight before renormalizing • Renormalizing does not add precision (zero fills LSBs) • Output is 1 bit wider than input • Sum of similar sized addends FFT Butterfly Complex outputs Complex inputs “Twiddle factor”wk=cos(w)+jsin(w)

Butterfly wk wk wk wk FFT output is only as precise as largest input • Cascade of butterfly elements • Each output is essentially an adder tree with phase rotators • Rotators don’t change scale • Inputs right shifted to match scale of largest input • intermediate renormalizing not effective • Term from every FFT input • 1 bit growth per stage • Renormalize maintains width • Alternative: grow word width • Similar effect in other FFTs • Winograd, Sande-Tukey, Singleton etc.)

Fixed Point FFT Replaces Floating Point FFT • Denormalize inputs • Shift each input right to match scale of largest • Perform fixed point FFT • Pass common exponent around it • Input width = mantissa bits • Maximum 1 bit growth per equivalent radix 2 stage • Renormalize outputs • Add common exponent to delta exponent from renormalize Max Exponent + Exp. - Exp. Fixed Point FFT >>n << Mant. Mant. Denormalize Renormalize

Advantages and Limitations • Advantages • Large reduction in required hardware • Less complexity means higher clock rates, smaller parts • Limitations • Word width grows for each radix 2 stage • Becomes excessive for large FFTs • Max Exponent needed at beginning of set • Problem for large sequential FFTs • Use periodic renormalization to manage word widths • A few bits growth don’t significantly affect timing • Word not limited to specific widths in FPGA • Fixed width assets like DSP48s limit practical word sizes. • Find balance between precision, growth and renormalizing stages

Small FFTs as building blocks • Larger FFT constructed from small FFTs with “mixed radix” algorithm • Similar to Cooley-Tukey decomposition • Arbitrarily large FFTs using small off-the shelf kernels • Combination uses FFT plus phase rotator and reorder memory • “In-place” operation (results written to same memory locations) FFT down cols Fill along rows Mult by e-j2pkn/N FFT along rows Read down cols

Different factorization Minimizes multiplies Advantageous for hardware implementation 74 adds and 18 real multiplies for 16pt Winograd 176 adds and 72 real multiplies for 16pt Cooley-Tukey Irregular data sequence Difficult for shared memory Easy when reorder memory is distributed Winograd FFT Weights Reorder Reorder Reorder Reorder Reorder Reorder

Data Reorder 4k sample BRAM Data Reorder 512 sample BRAM 4/8/16 Point FFT Phase Rotator 8/16 Point FFT 1/8Point FFT Phase Rotator 32 to 2048 point mixed radix FFT • 2K FFT is 8 x 256 mixed radix • 256 point is 16 x 16 mixed radix • Combined algorithms 2K = 8 x 16 x16 • Data arranged in cube, FFT along each dimension • Reorder at input and output (not shown) • Kernel is proprietary 1/4/8/16 Winograd kernel • Each kernel has floating point wrapper 32/64/128/256 point FFT

32-2K point FFT statistics • Speed: 400 MS/sec per FFT engine (3 in FPGA) • 400MHz clock in XC4VSX55-10 (slowest speed grade) • 1 complex sample per clock in and out continuous • Latency: ~430 + 3*FFT length + (32,64,128 or 256) clocks • Utilization – less than 30% of XC4VSX55 • DSP48’s: 151 • Slice flip-flops: 9707 • RAMB16’s: 69 • LUT’s: 7736 (4975 are SRL16) • Precision • 30-35 bit mantissa internal, 8 bit exponents • IEEE single precision input and output • Matches Matlab FFT to +/- 1 LSB of output mantissa

Input Buffer 32 to 2K pt floating pt FFT Output buffer 32 to 2K pt floating pt FFT 32 to 2K pt floating pt FFT 1.2 GSample/sec IEEE floating point FFT

Who is Andraka Consulting Group? • Exclusively FPGAs since 1994 • Leading industry expert on DSP in FPGAs • Charter Xilinx ‘Xperts’ partner • First published FIR filter in FPGAs (1992) • Fastest single threaded FFT kernel for FPGA • Other current projects • Beamforming digital receiver: 10 25MHz channels, 260 antennas, 500MS/sec input sample rate • Cylindrical Sonar Array processor • Other Digital receiver and radar projects

Floating Point Format • Floating point dedicates part of word to indicate scale (exponent) • Tracks radix point position as part of data • Compare to fixed point where radix point position is at an implied fixed location • Trades precision for dynamic range • Useful when data range is unknown or spans a large range • The IEEE single precision floating point standard is a 32 bit word, • Leftmost bit is the sign bit, S. ‘1’ is negative, ‘0’ is positive • Next 8 bits are exponent, excess 127 format • Right 23 bits are the fraction. There is an implicit ‘1’ bit to the left of the fraction except in special cases. The fraction’s radix point is between the implied ‘1’ and the leftmost bit of the fraction. • S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF • Number = -1S * 2 (E-127) * (1.F)

HPEC 2006 Poster Session B.4 20 September 2006 Ray Andraka, P.E.

HPEC 2006 Poster Session B.4 20 September 2006 Ray Andraka, P.E.

Presentation Transcript

SEPTEMBER 2006

September, 2006

AAPOS poster 2006

September 2006

September 2006

September 20, 2006

20 September 2006

September 2006

September 2006

BioBusiness Network 2006 Geneva 20 th September 2006

September 2006

September, 2006

September, 2006

September 2006

September 2006

September 2006

September 2006

September 2006

September 2006

HPEC 2006 Poster Session B.4 20 September 2006 Ray Andraka, P.E.