a new class of high performance ffts
Download
Skip this Video
Download Presentation
A New Class of High Performance FFTs

Loading in 2 Seconds...

play fullscreen
1 / 11

A New Class of High Performance FFTs - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

A New Class of High Performance FFTs. Dr. J. Greg Nash Centar (www.centar.net) [email protected] High Performance Embedded Computing (HPEC) Workshop 19-21 September 2006. New Base-4 DFT Matrix Equation. Traditional DFT Matrix form: New Matrix form for DFT †

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A New Class of High Performance FFTs' - ciro


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a new class of high performance ffts

A New Class of High Performance FFTs

Dr. J. Greg Nash

Centar (www.centar.net)

[email protected]

High Performance Embedded Computing (HPEC)

Workshop

19-21 September 2006

new base 4 dft matrix equation
New Base-4 DFT Matrix Equation
  • Traditional DFT Matrix form:
  • New Matrix form for DFT†
  • CM 1 and CM 2 contain only elements from the set
    • CM 1X and CM 2Yt only involve complex additions/subtractions
  • Twiddle factor matrix WM is of size N/4 x N/4 rather than N x N of C
    • x16 fewer multiplies than traditional DFT equation (Z=CX)

“ ”= element by element

multiply

†J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform,

” IEEETransactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp. 4640 – 4651.

find systolic architecture using spade
Find Systolic Architecture Using SPADE†

Simulator,

Graphical

Outputs

Mathematical

Algorithm

Input

Code

Automatic

Search for Space-Time

Transformations, T

for j to N/4 do

for k to N/4 do

Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4);

od;

for k to 4 do

Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4);

od

od;

FPGA Architectural

Constraints

Objective Functions

-2-D mesh array

-fine grained PEs (registers,adder,mux)

-linear arrays of multipliers, memory

†Symbolic Parallel Algorithm Development Environment

functional operation
Functional Operation
  • Processing flow for DFT of length N = N1 * N2
    • Stage 1: N2 column DFTs (Xci) of length N1
    • Stage 2: Twiddle multiplication
    • Stage 3: N1 row DFTs (Xri) of length N2
  • Systolic adder arrays for matrix multiplication
    • N1/4 x 4 array for column multiplies CM1Xci and CM2Ytci
    • N2/4 x 4 array for row multiplies CM1Xriand CM2Ytri
      • N2/4 x 4 array is implemented virtually on one row of N1/4 x 4 array
      • Uses systolic 1-D array matrix multiplication
fft systolic architecture
FFT Systolic Architecture

Example Architecture for N = 1024

(N1 = N2 = 32)

  • Simple PEs, locally connected
    • Higher clock speeds
    • Easier design/test/maintainability
    • Lower power
    • Efficient use of FPGA fabric
    • Simple control
  • Small memory blocks (one per PE)
    • Faster read/write times
    • Lower power
  • Linear structure (scales in N/S direction)
    • Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers)
enhanced functionality
Enhanced Functionality
  • Transform size N not restricted to powers of two
    • N = 256n, (n = 1,2,3,..)
    • More reachable points
    • Uniform distribution of points
  • Circuit is scalable
    • Any DFT size can be computed on the same hardware with sufficient memory
    • Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks
  • Low computational latency
    • Pipeline depth small, vs for traditional pipelined FFTs
  • 1-D and 2-D transforms possible on the same circuit
block floating point floating point operation
Block Floating Point/Floating Point Operation
  • Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT)
    • Column DFTs use block floating point and row DFTs use floating point
    • Higher dynamic range and lower signal to noise ratio
  • Number of regions increases with transform size
  • Supports streaming FFT’s
  • Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):
performance comparison 256 point dft
Performance Comparison: 256-point DFT
  • Altera block floating point circuit
  • “Streaming” (continuous data in and out)
  • Comparable dynamic range and signal to (roundoff) noise ratio
  • Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA
  • Altera circuit from Megacore FFT v2.2.0
  • Results from timing analysis (Altera Quartus 5.1 software)
preliminary figure of merit
Preliminary Figure of Merit
  • Altera block floating point circuits
  • “Streaming” (continuous data in and out)
  • Comparable dynamic range and signal to noise ratio
  • Circuits mapped to Altera Stratix II FPGAs
  • Altera circuit from Megacore FFT v2.2.0

FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz)

*Estimate (no timing analysis or layout)

comparative features
Comparative Features
  • Transform size N not restricted to powers of two
  • Circuit is scalable
  • Uses block floating point and floating point
  • Higher throughput
  • Low computational latency
  • Based on small, simple PE (adder), locally connected
  • 1-D or 2-D transforms
ad