1 / 17

Fast Fourier Transform Implementation on Cell Broadband Engine Architecture

Fast Fourier Transform Implementation on Cell Broadband Engine Architecture. Aarul Jain CSE520, Advanced Computer Architecture Fall 2007. OBJECTIVES.

Download Presentation

Fast Fourier Transform Implementation on Cell Broadband Engine Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Fourier Transform Implementation on Cell Broadband Engine Architecture Aarul Jain CSE520, Advanced Computer Architecture Fall 2007

  2. OBJECTIVES • Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased. • Fast Fourier Transform on PPE/single SPU. • Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.) • Pipelined implementation on multiple SPUs. • Performance : • FFT kernel • DMA data transfer

  3. CELL BE • PPE • 64bit Power architecture with VMX. • In-order, 2-way SMT. • 32KB L1, 512KB L2 Cache. • SPE • 256 KB local store. • In-order, No speculation. • 128 registers for all data types. • EIB • Four 16B data rings. • Over 100 outstanding requests.

  4. FFT on single PPU/SPU • FFT compute intensity O(nlogn) • Implementation on PPU • Cache based memory architecture – No software controlled memory. • Implementation on SPU • Software controlled memory. • Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)

  5. RESULTS (single PPU/SPU) N v/s cycles

  6. Conclusions(single PPU/SPU) • Number of cycles on PPU and SPU scale with order NlogN. • Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access. • Very efficient DMA. • Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up. • DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: • Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) • Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)

  7. Data/Task parallel on multiple SPUs • Multiple FFTs running on each SPU and each SPU works on different data. • Limitation of local store memory. • Single buffer approach => 8K points • Double buffer approach => 4K points • Single buffer v/s double buffer. • Performance as number of active SPUs are increased.

  8. RESULTS(Data/Task parallel )

  9. Conclusion(Data/Task parallel ) • More compute power with multi-processors • For FFT -> almost 8 times if thread creation is not counted. • Using double buffering may not always give speed advantage. (Amdahl’s law) • Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. • Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.

  10. Comparison with published results • Reference http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1 No. of cycles for single 4K point FFT = 24688 No. of floating point operations = 4*1024*log(4*1024) = 49152 Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec MY RESULTS IBM RESULTS

  11. Problems faced • CELL architecture and its programming environment is completely new. Unknown problems come up. • Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K. • Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. • Takes 2 days to run a 8K point fft. • Simulator crashing when mode is changed multiple times. • Debug support very complex.

  12. SUGGESTIONS • Use the forum alphaworks: excellent forum with quick response time. • To profile accurately run simulation in cycle mode. • Commands for profiling • __mftb() -> FOR PPE • spu_writech(), spu_readch() -> FOR SPE

  13. Future work • Pipelined implementation of FFT. • Standalone mode. • Higher order FFTs. • Compiler performance.

  14. References • http://www.ibm.com/developerworks/forums/thread.jspa?threadID=160216 • Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11, 2007. • IBM Cell Broadband Engine Software Development Kit, http://alphaworks.ibm.com/tech/cellsw?open&S_TACT=105AGX16&S_CMP=DWPA • Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September 2005. • Perrone M., Introduction to the Cell Processor (lecture), http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf • Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February 2005. • Krewell K., Chips, Software, and Systems, Microprocessor Report, January 2005. • http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=182042

  15. Q/A

  16. Double buffer Code loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();

  17. BUS ERROR mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }

More Related