1 / 58

FFT Accelerator Project

FFT Accelerator Project. 4 th October, 2007. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). FPGA: Overview. Work done Structure of a sample program Ongoing Work Next Step. FPGA : work done. Register handling and console IO Modified simple.c Implemented an adder

todd
Download Presentation

FFT Accelerator Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FFT Accelerator Project 4th October, 2007 Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210)

  2. FPGA: Overview • Work done • Structure of a sample program • Ongoing Work • Next Step

  3. FPGA : work done • Register handling and console IO • Modified simple.c • Implemented an adder • Used VirtualBase member of ADMXRC2_SPACE_INFO • Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga

  4. Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …

  5. Ongoing work : ZBT • Structure of zbt_main seems to be similar to simple.c • zbt.vhd is a wrapper for zbt_main.vhd • Same port names defined in the same way and port mapped to each other • Do not understand the reason for this wrapper • C code not available in ADMXRC2 demos • Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code

  6. Next Step • To work with zbt and block RAMs • FFT implementation on the FPGA

  7. Multiprocessor FFT Overview • Some improvements to the existing code • Improve the theoretical model • Compare theoretical run-time with actual run time • Statistics of each processor • Further refinement: Using BSP model • Pointers for Cache Analysis

  8. Optimizations to the code • Removed other arrays (reducing memory references considerably) • Twiddle factors • Bit reversal addresses • Bit reversal faster using bit operations O(1) for each address calculation • All multiplications/divisions involving 2 implemented using shift operations O(1) • Power (2^n) in constant time using bit operations O(1)

  9. Previously…

  10. Now…

  11. Improvement • For larger input size, our program (radix-2) is comparable to FFTW • Our program might surpass FFTW • Using SIMD • Higher radix (e.g. 4,8,16) • Coding in C

  12. Redefining the execution time • For p processors, the total execution time is : (TN/p) + (1 – 1/p)(2N/B + KN) • p is a power of 2 • This assumes “RAM Model” • Assumes a flat memory address space with unit-cost access to any memory location • We did not take into account the memory hierarchy • E.g. matrix multiplication actually takes O(n5) instead of expected O(n3) [Alpern et al. 1994]

  13. Redefining the execution time • Some observations • If the #processors are p , then the actual FFT computed if FFT(N/p)  time taken is TN/pand NOT TN/p • Time taken to combine (O(n) in RAM model) should be taken as: ΣKN/2i (i = 1 to log p) • NOT included the synchronization time • Currently looking execution time only from the perspective of master processor • The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible

  14. New Theoretical Formula • Time taken for parallel execution with p processors is TN/p+ (1-1/p)(2N/B) + ΣKN/2i (i = 1 to log p)

  15. Execution Time: 16777216

  16. Input: 16777216 (p=2) T=35.541 T=26.591 T=29.799 P2 Recv(1) FFT(N/2) Send(1) Send(2) FFT(N/2) Recv(2) Combine P1 T=0 T=20.865 T=26.579 T=29.848 T=35.555 T=35.808

  17. Load Distribution: Processor 1

  18. Load Distribution: Processor 2

  19. Input:16777216 (p=4) T=26.479 T=31.032 T=29.332 T=26.617 T=29.547 T=30.835 T=33.672 T=31.045 T=33.96 P4 Recv(2) Send(2) FFT(N/4) T=33.977 T=34.166 Recv(1) P3 FFT(N/4) Send(1) T=39.85 Recv(1) P2 Send(1) Send(4) FFT(N/4) Recv(4) Combine Send(2) Send(3) Combine Combine Recv(3) Recv(1) P1 FFT(N/4) T=0 T=20.773 T=26.464 T=29.315 T=29.532 T=30.816 T=33.686 T=39.869 T=33.812 T=40.120

  20. Load Distribution: Processor 1

  21. Load Distribution: Processor 2

  22. Load Distribution: Processor 3

  23. Load Distribution: Processor 4

  24. Execution Time: 33554432

  25. Input: 33554432 (p=2) T=114.965 T=121.558 T=133.322 P2 Recv(1) FFT(N/2) Send(1) Send(2) FFT(N/2) Recv(2) Combine P1 T=0 T=103.558 T=114.954 T=121.921 T=133.335 T=133.851

  26. Load Distribution: Processor 1

  27. Load Distribution: Processor 2

  28. Input: 33554432 (p=4) T=91.294 T=101.052 T=97.001 T=91.579 T=97.909 T=105.854 T=100.164 T=101.043 T=106.939 P4 Recv(2) Send(2) FFT(N/4) T=106.951 T=107.351 Recv(1) P3 FFT(N/4) Send(1) T=118.748 Recv(1) P2 Send(1) Send(4) FFT(N/4) Recv(4) Combine Send(2) Send(3) Combine Combine Recv(3) Recv(1) P1 FFT(N/4) T=0 T=70.881 T=91.281 T=96.982 T=97.896 T=100.128 T=105.864 T=118.757 T=106.116 T=119.261

  29. Load Distribution: Processor 1

  30. Load Distribution: Processor 2

  31. Load Distribution: Processor 3

  32. Load Distribution: Processor 4

  33. Execution Time:67108864

  34. Input: 67108864 (p=2) T=199.092 T=212.858 T=252.553 P2 Recv(1) FFT(N/2) Send(1) Send(2) FFT(N/2) Recv(2) Combine P1 T=0 T=176.271 T=199.081 T=221.761 T=252.656 T=324.062

  35. Load Distribution: Processor 1

  36. Load Distribution: Processor 2

  37. Input: 67108864(p=4) T=220.196 T=239.257 T=233.192 T=220.772 T=232.65 T=274.299 T=239.773 T=239.238 T=250.893 P4 Recv(2) Send(2) FFT(N/4) T=250.903 T=252.737 Recv(1) P3 FFT(N/4) Send(1) T=305.326 Recv(1) P2 Send(1) T=544.529 Send(4) FFT(N/4) Recv(4) Combine Send(2) Send(3) Combine Combine Recv(3) Recv(1) P1 FFT(N/4) T=0 T=193.211 T=220.211 T=233.173 T=232.645 T=262.629 T=274.300 T=305.333 T=280.422

  38. Load Distribution: Processor 1

  39. Load Distribution: Processor 2

  40. Load Distribution: Processor 3

  41. Load Distribution: Processor 4

  42. Inference • The idle time is very less (for processor 1) • The theoretical model matches with actual results • But, we need to find a closed form solution for TN and KN

  43. Calculating TN and KN • Depends upon • N : Size of the input • A: Cache Associativity • L: Cost incurred for a miss • M: Size of the cache • B: Number of Bytes it can transfer at a time

  44. Contd… • Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses • We have this table (computed in the summers) • We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number

  45. Theoretical Verification • S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” • It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) • Still reading it

  46. Modeling using BSP • BSP (Bulk Synchronous Parallel) model considers • The whole job as a series of supersteps • At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished

  47. Modeling using BSP • BSP model uses the following parameters – • p the number of processors (p = ^2 for us) • wtthe maximum local work performed by any processor • L the time machine needs for barrier synchronization (determined experimentally) • g the network bandwidth inefficiency (reciprocal of B,determined experimentally)

  48. Modeling using BSP step 0 step 1 step 2 step 3 step 4 step 5 step 6 Send(2) P4 Recv(1) FFT(N/4) Send(1) Recv(1) FFT(N/4) P3 Recv(3) Combine Send(1) Recv(1) Send(4) FFT(N/4) P2 Recv(1) Recv(1) Combine Send(2) Send(3) FFT(N/4) Combine P1 barrier barrier barrier barrier barrier barrier barrier

  49. Execution time • Step 0: L • Step1: L+max(time(Send(2)),time(Recv(1))) • Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) • Step 4: L+max(FFTi(N/p)) (0<=i<=p-1) • Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) • Step 6: L+max(time(combinei(N/4)) (i={1,2}) • Step 7: L+max(time(Send(1)),time(Recv(2))) • Step 8: L+ time(combine(N/2))

  50. Generalizing this for p processors communications 0<= t < logp compute FFT(N/p) t = logp event(t) communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)

More Related