1 / 25

Generating FPGA-Accelerated DFT Libraries

Generating FPGA-Accelerated DFT Libraries. Chi-Li Yu Nov. 13, 2007. Overview. Application: 1D/2D Discrete Fourier Transform Problem: Hardware-Software Partitioning Acceleration Based on FPGA Results (compared to software-only solution): Up to 7.5 times higher performance

barney
Download Presentation

Generating FPGA-Accelerated DFT Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007

  2. Overview • Application: • 1D/2D Discrete Fourier Transform • Problem: • Hardware-Software Partitioning • Acceleration Based on FPGA • Results (compared to software-only solution): • Up to 7.5 times higher performance • Up to 2.5 times better energy efficiency

  3. Why DFT? • Discrete Fourier Transform (DFT) is an important primitive underlying many DSP applications. • Imaging/speech processing • Communication systems • Computation-intensive • Data/memory-intensive

  4. Review of DFT Requires N2 complex multiplies and N(N-1) complex additions When N is a power-of-two, 2p:

  5. Pipelined streaming architecture of FFT Data flow diagram of Fast Fourier Transform (FFT) Pipelined streaming architecture (Throughput: 1 sample/clock)

  6. Problem • Pure hardware implementation • N should be a power-of-two • N is usually fixed • Arbitrary sized DFT is hard to be implemented • Flexible programmability/Fast execution time • Hardware-Software heterogeneous architecture • HW-SW partitioning

  7. Principles of HW-SW partitioning Xilinx Virtex-II Pro Platform FPGA • Hardware: • The most computation intensive kernels that are conducive to hardware acceleration are extracted from an algorithm and realized as hardware. • Software: • Remaining computations are carried out in software. • Control-intensive part.

  8. Xilinx Virtex-II Pro Platform FPGA • Field Programmable Gate Array: FPGA • Process: 0.13um, 1.5v • Flexible Logic Resources • Up to 1M gate-count capacity • Up to 8 Mb of True Dual-Port RAM • Embedded IBM PowerPC 405 RISC processor blocks • provide performance up to 400 MHz

  9. The way to achieve hardware acceleration for DFT When considering power-of-2 problem sizes (i.e., DFTs on 2p points), we only need to consider two-power sized DFT kernels (i.e., DFT2q ). By off-loading the appropriate kernels into hardware, the software receives the benefit of hardware acceleration and yet can still compute arbitrary sized DFTs on top of the available kernels.

  10. Research problem • Different kernels in hardware yield • Different performance (e.g., operations per second) • Different amounts of resources (e.g., logic, number of BRAM, or power consumption). • DFT partitioning problem • Selecting the appropriate set of throughput optimized two-power sized DFT cores to satisfy a given resource constraint (logic, power, energy) while maximizing a scalar metric, such as performance.

  11. Test platform based on the FPGA • Notice that the data cache of PowerPC is 16kB.

  12. Architecture of the generated hardware DFT IP cores FPGA

  13. DFT Performance (N is a power-of-two) • The highest performance is reached at the core’s native size. • Data does not fit into data cache at N = 8192. • Memory bandwidth becomes the main bottleneck and practically reduces all possible speedups.

  14. DFT Performance (N is not a power-of-two) N=3*2k and N=5*2k Radix-3 and Radix-5 operations are done in software.

  15. DFT Precision

  16. 1D DFT with different core sizes Up to 7.5 times speedup. The best choice depends on the targeted applications. For small problem sizes, software is the most energy-efficient choice.

  17. 2D DFT with different core sizes Up to 4 times speedup. Again, for small problem sizes, software is the most energy-efficient choice. All sizes larger than or equal to 64x128 do not fit into data cache of PPC, which leads to a performance degradation.

  18. Area/performance There is also a 3 times variation in the power consumed by the DFT calculations. In other words, by allowing up to 3 times more power (or 4 times more area) to be consumed, one can speed up a whole library up to 4 times (averaged across the library).

  19. Power/performance There is a 4 times variation in both area consumption and normalized runtime across all possible.

  20. Conclusions In the experiments on a Xilinx Virtex-II Pro, the automatically partitioned and generated FPGA-accelerated library has between 2 and 7.5 times higher performance and up to 2.5 times better energy efficiency than the software-only version. We have integrated this approach in the “Spiral linear-transform code-generation framework” to support push-button automatic implementation.

  21. Conclusions Architectures with tightly integrated FPGAs and general purpose processors are starting to play an important role in both embedded and high performance computing settings. The tight integration makes it possible to offload fine and coarse grain functionalities from processors to the FPGA fabric, combining the strengths of both components.

  22. My critiques about this paper • Strength: • Detailed analysis on the HW-SW partitioning. • Comparisons on performance and energy efficiency are very valuable. • Weakness: • 2D DFT on this platform is not efficient. • Communications between PPC and FPGA slow down the whole operation.

  23. What is relative to our class? A heterogeneous architecture combining two different cores: one RISC CPU and one programmable hardware, FPGA. Discussions on the power consumption of this kind of platform are interesting.

  24. What is relative to our project? • The same applications • Discrete Fourier Transform. • The same platform • Xilinx FPGA • Reduce the workload of PPC. • Introduce the concept of multi-core architectures to our hardware design.

  25. Paper Paolo D’Alberto, et al., “Generating FPGA-Accelerated DFT Libraries,” in Proceedings of 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'07), pp. 173-184, Napa Valley, CA, US, 23-25th, April 2007.

More Related