A New Class of High Performance FFTs

1 / 11

# A New Class of High Performance FFTs - PowerPoint PPT Presentation

A New Class of High Performance FFTs. Dr. J. Greg Nash Centar (www.centar.net) [email protected] High Performance Embedded Computing (HPEC) Workshop 19-21 September 2006. New Base-4 DFT Matrix Equation. Traditional DFT Matrix form: New Matrix form for DFT †

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'A New Class of High Performance FFTs' - ciro

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A New Class of High Performance FFTs

Dr. J. Greg Nash

Centar (www.centar.net)

[email protected]

High Performance Embedded Computing (HPEC)

Workshop

19-21 September 2006

New Base-4 DFT Matrix Equation
• Traditional DFT Matrix form:
• New Matrix form for DFT†
• CM 1 and CM 2 contain only elements from the set
• CM 1X and CM 2Yt only involve complex additions/subtractions
• Twiddle factor matrix WM is of size N/4 x N/4 rather than N x N of C
• x16 fewer multiplies than traditional DFT equation (Z=CX)

“ ”= element by element

multiply

†J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform,

” IEEETransactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp. 4640 – 4651.

Find Systolic Architecture Using SPADE†

Simulator,

Graphical

Outputs

Mathematical

Algorithm

Input

Code

Automatic

Search for Space-Time

Transformations, T

for j to N/4 do

for k to N/4 do

od;

for k to 4 do

od

od;

FPGA Architectural

Constraints

Objective Functions

-2-D mesh array

-fine grained PEs (registers,adder,mux)

-linear arrays of multipliers, memory

†Symbolic Parallel Algorithm Development Environment

Functional Operation
• Processing flow for DFT of length N = N1 * N2
• Stage 1: N2 column DFTs (Xci) of length N1
• Stage 2: Twiddle multiplication
• Stage 3: N1 row DFTs (Xri) of length N2
• Systolic adder arrays for matrix multiplication
• N1/4 x 4 array for column multiplies CM1Xci and CM2Ytci
• N2/4 x 4 array for row multiplies CM1Xriand CM2Ytri
• N2/4 x 4 array is implemented virtually on one row of N1/4 x 4 array
• Uses systolic 1-D array matrix multiplication
FFT Systolic Architecture

Example Architecture for N = 1024

(N1 = N2 = 32)

• Simple PEs, locally connected
• Higher clock speeds
• Easier design/test/maintainability
• Lower power
• Efficient use of FPGA fabric
• Simple control
• Small memory blocks (one per PE)
• Faster read/write times
• Lower power
• Linear structure (scales in N/S direction)
• Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers)
Enhanced Functionality
• Transform size N not restricted to powers of two
• N = 256n, (n = 1,2,3,..)
• More reachable points
• Uniform distribution of points
• Circuit is scalable
• Any DFT size can be computed on the same hardware with sufficient memory
• Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks
• Low computational latency
• Pipeline depth small, vs for traditional pipelined FFTs
• 1-D and 2-D transforms possible on the same circuit
Block Floating Point/Floating Point Operation
• Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT)
• Column DFTs use block floating point and row DFTs use floating point
• Higher dynamic range and lower signal to noise ratio
• Number of regions increases with transform size
• Supports streaming FFT’s
• Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):
Performance Comparison: 256-point DFT
• Altera block floating point circuit
• “Streaming” (continuous data in and out)
• Comparable dynamic range and signal to (roundoff) noise ratio
• Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA
• Altera circuit from Megacore FFT v2.2.0
• Results from timing analysis (Altera Quartus 5.1 software)
Preliminary Figure of Merit
• Altera block floating point circuits
• “Streaming” (continuous data in and out)
• Comparable dynamic range and signal to noise ratio
• Circuits mapped to Altera Stratix II FPGAs
• Altera circuit from Megacore FFT v2.2.0

FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz)

*Estimate (no timing analysis or layout)

Comparative Features
• Transform size N not restricted to powers of two
• Circuit is scalable
• Uses block floating point and floating point
• Higher throughput
• Low computational latency
• Based on small, simple PE (adder), locally connected
• 1-D or 2-D transforms