1 / 22

Automatic Generation of Customized Discrete Fourier Transform IPs

Automatic Generation of Customized Discrete Fourier Transform IPs. Grace Nordin, Peter A. Milder, James C. Hoe, Markus P üschel Carnegie Mellon University. This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA program. www.spiral.net.

burton
Download Presentation

Automatic Generation of Customized Discrete Fourier Transform IPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA program www.spiral.net

  2. The Paradox of Reusable IPs • Boon to productivity • zero effort required • zero knowledge required • zero chance to introduce new bugs Why repeat what has already been done? • Bane to optimality • finding the right functionality with the right interface • design tradeoff -- performance, area, power, accuracy ..... Are you getting what you really wanted? • Solution: parameterized automatic IP generators • zero effort, knowledge or bugs • allows application specific customization • facilitates design exploration

  3. Our Work: Discrete Fourier Transform IPs • Discrete Fourier Transform (DFT) • important building block in DSP applications • numerous design “cores” available • Current IP libraries support: • various sizes, number formats, data orderings • only a small number of microarchitecture choices • (Xilinx LogiCore DFT gives 3 choices) • We generate IPs with custom design tradeoffs • degree of parallelism in microarchitecture (min  max) • resource preference (e.g. BRAM vs. slices in FPGAs) Extensible to other common linear DSP transforms

  4. Outline • Introduction • Formula-Driven Design Generation • Microarchitecture Parameterization • Generator User Interface • Experimental Results • Conclusions

  5. Transforms as Formulas [www.spiral.net] • Transform computation is represented as matrix-vector multiplication • Matrix-vector multiplication is O(n2) operations • “Fast” algorithms factor the transform into a sequence of structured sparse matrices • O(n log n) operations DFT: FFT: Datapath easily formed from factorized formulas

  6. A A B A ×7 ×8 ×2 ×4 Formula to Datapath • Given where is: • apply , then • is a permutation permute • apply , times in parallel • is a diagonal scale

  7. Outline • Introduction • Formula-Driven Design Generation • Microarchitecture Parameterization • Generator User Interface • Experimental Results • Conclusions

  8. stage 3 stage 2 stage 1 Pease DFT butterfly • Simple regular structure embodied in formula • Example: k stages permutation diagonal parallel

  9. x x x x x x x x x x x x Pease DFT Example: DFT8 (datapath is built left to right) stage 3 stage 2 stage 1 Repeating column structure  hardware reuse without performance penalty (formula is applied from right to left)

  10. x x x x x x x x x x x x Horizontal folding • our baseline design • degree of freedom: vertical parallelism • parameter p p register inputbypass

  11. Vertical (V-)folding according to p latency cost Fine-grained control over cost/latency tradeoff

  12. Outline • Introduction • Formula-Driven Design Generation • Microarchitecture Parameterization • Generator User Interface • Experimental Results • Conclusions

  13. common DFT options customization options User Interface http://www.spiral.net/hardware/dftgen.html

  14. Outline • Introduction • Formula-Driven Design Generation • Microarchitecture Parameterization • Generator User Interface • Experimental Results • Conclusions

  15. Evaluation • We compare against Xilinx LogiCore DFT Ver. 3.1 • radix-4 burst I/O interface We compare Xilinx’s fixed design against our variable generated designs • Comparison • DFT n = {64, 1024, 2048}; width = 16; bit-reversed output • Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6

  16. DFT1024 relative to Xilinx storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices Xilinx Performance and resources scale with p

  17. 35 14 30 12 25 10 20 8 relative BRAMs relative slices 15 6 10 4 5 2 0 0 1 2 4 8 16 32 1 2 4 8 16 32 p p Resource usage preferences storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices 6 4 speedup 2 Xilinx 0 1 2 4 8 16 32 p

  18. Resource usage preferences storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices Xilinx • exchange BRAM for slices • very little change in performance Can control tradeoff between slices and BRAMs

  19. 2048 64 DFT64 and DFT2048 1.0 = 1 transform / 0.648 µsec 1.0 = 8 BRAMs 1.0 = 1743 slices Xilinx 1.0 = 7 BRAMs 1.0 = 1 transform / 24.578 µsec 1.0 = 2140 slices Xilinx Trends hold for sizes 64, 2048

  20. Related Work • Kumhom, Johnson, Nagvajara, ASIC/SOC 2000 • universal FFT processor microarchitecture based on processing elements interconnected by on-chip reconfigurable network • microarchitecture is scalable in the number of elements • supports both Cooley Tukey and Pease • Choi, Scrofano, Prasanna, Jang, FPGA’2003 • mapped radix-4 Cooley-Tukey algorithm onto log2(n)/2 DFT4 primitives • scalable datapath between 1 element and 4 elements at a time • show energy and performance improvements from scaling

  21. Conclusions • Parameterized DFT IP generator • matrix formula-driven synthesis • performance/cost tradeoff • fine-grained control over resources vs. latency • resource usage preference • can balance tradeoff between slices and BRAM • Key results • efficient: the Xilinx design point can be matched • customizable: design tradeoffs directly controllable • easy to use: simple yet powerful web interface

  22. Web Generator http://www.spiral.net/hardware/dftgen.html • This work is part of the SPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms.For more information visit: www.spiral.net http://www.spiral.net/hardware/dftgen.html

More Related