Optimizing the fast fourier transform on a multi core architecture
Download
1 / 28

Optimizing the Fast Fourier Transform on a Multi-core Architecture - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Optimizing the Fast Fourier Transform on a Multi-core Architecture. Presentation by Tao Liu. Long Chen; Ziang Hu; Junmin Lin; Gao, G.R.; IEEE International Parallel and Distributed Processing Symposium, 2007. Outline. FFT Introduction FFT Implementation-State of Art Introduce this paper

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimizing the Fast Fourier Transform on a Multi-core Architecture' - trinh


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Optimizing the fast fourier transform on a multi core architecture

Optimizing the Fast Fourier Transform on a Multi-core Architecture

Presentation by Tao Liu

Long Chen; Ziang Hu; Junmin Lin; Gao, G.R.;

IEEE International Parallel and Distributed Processing Symposium, 2007.


Outline
Outline Architecture

  • FFT Introduction

  • FFT Implementation-State of Art

  • Introduce this paper

  • Summary

  • Relationship to our project

  • Q&A


Fft introduction
FFT Introduction Architecture

  • Media Application (2D)

    256* 256 256*256

    • Original picture and video require huge number of data

    • DFT can reconstruct the picture with less data.

  • Communication Application (1D)


Fft introduction cont
FFT Introduction (cont.) Architecture

  • Fundamental Mathematics

  • 2D

    N*N -> N*N

  • 1D Example


Fft introduction cont1
FFT Introduction (cont.) Architecture

  • 8-points 1D FFT


Outline1
Outline Architecture

  • FFT Introduction

  • FFT Implementation-State of Art

  • Introduce this paper

  • Summary

  • Relationship to our project

  • Q&A


Fft implementation state of art
FFT Implementation-State of Art Architecture

  • Hypercube architecture [9, 13, 17]

  • Arrays architecture [12]

  • Mesh architectures[18]

  • An instructive shared-memory FFT for Alliant FX/8 [19].

  • Additional performance studies on shared-memory FFT [3, 16]

  • Memory hierarchy to an effective FFT implementation[4]

  • CRAY-2 [5]

  • EARTH[20], a finegrained dataflow architecture.

  • etc


Outline2
Outline Architecture

  • FFT Introduction

  • FFT Implementation-State of Art

  • Introduce this paper

  • Summary

  • Relationship to our project

  • Q&A


About this paper
About this paper Architecture

  • Platform: IBM Cyclops-64 (C64)

  • Optimization on 1D FFT

  • Optimization on 2D FFT


Ibm cyclops 64 c64
IBM Cyclops-64 (C64) Architecture

  • 80 64-bit processors, each processor 1 floating point unit (FPU) + 2 thread units (TUs).

  • 64 64-bit registers and 32 KB SRAM.

  • 16 shared instruction caches (ICs)

  • 4 off-chip DRAM controllers,

  • Crossbar network with a large number of ports (96×96), which sustains a 4GB/s bandwidth per port, thus 384GB/s in total.

  • Memory: scratch-pad (SP) memory, on-chip global interleaved memory (GM), and off-chip DRAM

  • GigaBit Ethernet controller and other I/O devices

  • Etc.


Ibm cyclops 64 c641
IBM Cyclops-64 (C64) Architecture

  • 80 cores 160 threads


1d fft
1D FFT Architecture

  • Base Parallel Implementation

  • Optimal Work Unit

  • Special Handling of the First Stages

  • Eliminating Unnecessary Memory Operations

  • Loop Unrolling

  • Register Renaming and Instruction Scheduling (Manually)

  • Memory Hierarchy Aware Compilation


1d fft cont
1D FFT (Cont.) Architecture


Optimal work unit
Optimal Work Unit Architecture

  • 1 butterfly operation

    • 1. read 2-point data and the twiddle factor from GM

    • 2. perform a bufferfly operations upon them, then

    • 3. write the 2-point results back to GM

    • 1 butterfly operation -> Thread


Optimal work unit1
Optimal Work Unit Architecture

  • 1 butterfly operation

  • 4 butterfly operations


Optimal work unit2
Optimal Work Unit Architecture

  • Number of cycles per butterfly operation VS the the size of work unit (8 point is the best)

  • Register spilling for large WU (Need 112 for 16-point)


2d fft
2D FFT Architecture

  • Base Parallel Implementation

  • Load Balancing

  • Work Distribution and Data Reuse

  • Memory Hierarchy Aware Compilation


2d fft base parallel implementation
2D FFT –Base Parallel Implementation Architecture

  • Base Parallel Implementation

    • All row FFTs are independent of each other, so they can be computed in parallel.

    • Row -> Column

    • Work units are distributed to threads in the round-robin way

    • Some threads remain idle (e.g. 180 rows, 160 threads)


2d fft load balancing
2D FFT –Load Balancing Architecture

  • Load Balancing

  • entire row/column1D FFT as a work unit -> small tasks

  • Small task: 8-point work unit. (8 input<-> 8 output)

  • it needs more barriers to synchronize threads working on the same row/column FFT


Results of this paper
Results of This Paper Architecture

  • 1D FFT 2^16 points


Results of this paper1
Results of This Paper Architecture

  • 2D FFT 256*256


Outline3
Outline Architecture

  • FFT Introduction

  • FFT Implementation-State of Art

  • Introduce this paper

  • Summary

  • Relationship to our project

  • Q&A


Summary
Summary Architecture

  • Introduction

    • Implemented and optimized FFT on a specific architecture –IBM Cyclops-64

  • Strengths

    • Several optimization techniques are proposed. Achieved performances are over 20Gflops for both 1D FFT and 2D FFT, which is about 4 times of Intel Xeon Pentium 4 processor (5.5Glops) (Essentiality: reduce memory operation)

  • Weakness

    • The fast scratchpad memory (SP) for each thread unit has not been fully explored.

    • Memory issue of large FFT size where data cannot be fully stored in on-chip memories.


Relationship to our project
Relationship to our project Architecture

  • Our project

    • Implement FFT on a FPGA platform

    • Architecture is changeable.

    • #cores/Memory/cache/

    • Performance/power etc

  • What we can use from this paper

    • Task partition & Distribution

    • Data Reuse

    • etc


Q&A Architecture


Reference
Reference Architecture

  • [1] R. Agarwal and J. Cooley. Vectorized mixed radix discrete fourier transform algorithms. In Proceedings of the IEEE,volume 75, pages 1283– 1292, 1987.

  • [2] M. Ashworth and A. G. Lyne. A segmented FFT algorithm for vector computers. Parallel Computing, 6:217–224, 1988.

  • [3] A. Averbuch, E. Gabber, B. Gordissky, and Y. Medan. A parallel FFT on an MIMD machine. Parallel Computing, 15:61–74, 1990.

  • [4] D. H. Bailey. FFTs in external or hierarchical memory. In Proceedings of the Supercomputing 89, pages 234–242,1989.

  • [5] D. A. Carlson. Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer. J. Supercomput., 4:345–356, 1990.

  • [6] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput.,19:297–301, 1965.

  • [7] J. del Cuvillo,W. Zhu, Z. Hu, and G. R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In Workshop on Modeling, Benchmarking, and Simulation (MoBS2005), in conjuction with the 32nd Annual International Symposium on Computer Architecture (ISCA2005), Madison, Wisconsin, June 2005.

  • [8] M. Frigo and S. Johnson. FFTW.

  • [9] A. G. and P. I. Parallel implementation of 2-d FFT algorithms on a hypercube. In Proc. Parallel Computing Action, Workshop ISPRA, 1990.

  • [10] A. Gupta and V. Kumar. On the scalability of FFT on parallel


Reference1
Reference Architecture

  • computers. In FMPSC: Frontiers of Massively Parallel Scientific Computation. National Aeronautics and Space Administration NASA, IEEE Computer Society Press, 1990.

  • [11] J. Johnson, R. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits, Systems, and Signal Processing, 9:449–500, 1990.

  • [12] S. Johnsson and D. Cohen. Computational arrays for the discrete Fourier transform. 1981.

  • [13] S. L. Johnsson and R. L. Krawitz. Cooley-tukey FFT on the connection machine. Parallel Computing, 18(11):1201– 1221, 1992.

  • [14] C. V. Loan. Computational framework for the fast Fourier transform. SIAM, Philadelphia, 1992.

  • [15] H. Nguyen and L. K. John. Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology. In International Conference on Supercomputing, pages 11–20, 1999.

  • [16] A. Norton and A. J. Silberger. Parallelization and performance

  • analysis of the cooley-tukey fft algorithm for sharedmemory architectures. IEEE Transactions on Computers, 36(5):581–591, 1987.

  • [17] D. M. S. L. Johnsson, R.L. Krawitz and R. Frye. A radix 2 FFT on the connection machine. In Proceedings of Supercomputing 89, pages 809–819, 1989.

  • [18] V. Singh, V. Kumar, G. Agha, and C. Tomlinson. Scalability of parallel sorting on mesh multicomputers. In International Parallel Processing Symposium, pages 92–101, 1991.


Reference2
Reference Architecture

  • [19] P. N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5(1-2):197–210, 1987.

  • [20] K. B. Theobald. EARTH: An Efficient Architecture for Running Threads. PhD thesis, May 1999.

  • [21] P. Thulasiraman, K. B. Theobald, A. A. Khokhar, and G. R. Gao. Multithreaded algorithms for the fast fourier transform. In ACM Symposium on Parallel Algorithms and Architectures, pages 176–185, 2000


ad