Loading in 5 sec....

FFT Accelerator ProjectPowerPoint Presentation

FFT Accelerator Project

- 96 Views
- Uploaded on
- Presentation posted in: General

FFT Accelerator Project

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

FFT Accelerator Project

Rohit Prakash (2003CS10186)

Anand Silodia (2003CS50210)

September 27th,2007

- Multiprocessor Implementation
- Problems faced
- Solutions
- Results

- FPGA IO
- Work done
- Problems faced
- Possible solutions

- The previous code worked for some inputs but not all
- The program seemed to communicate well but still error prone
- Lots of segmentation faults (even after getting the results)
- Serial debugger does not work
- Commercial debuggers available, but evaluation is restricted to single IP, 30 days

- “Execution Environment does not match the compile environment”
- Same code worked with MPICH version 2, GCC
- Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me)
- Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked)

- Machine 1: Saveri
- Machine 2: Abhogi
- Machine 3: Sahana
- Machine 4: Jaunpuri
- Sysinfo :
- Intel Pentium 4, 3.4 GHz
- Cache Size: 2048KB
- RAM 1GB
- Operating System : Fedora Core 6
- Compiler : mpic++
- Flags: -O3 –march=pentium4
- FFT : radix 2

- For p processors, the total execution time is :
(TN/p) + (1 – 1/p)(2N/B + KN)

- p is a power of 2

- Sum of two functions –
- (TN/p)
- (1 – 1/p)(2N/B + KN)

- When (TN/p) dominates
- When (1 – 1/p)(2N/B + KN) dominates

- Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup)
- Below this point
- the execution time increases with the increase in # processors
- the %age communication time decreases as the #processors increase

- Above this point
- the execution time decreases with the increase in #processors
- the %age communication time increases as the #processors decreases

- Measuring real time which is affected by the load on a particular processor
- Network Communication latency affects the time taken to establish a synchronous handshake
- The pipeline is actually not “perfect”

Send(2)

P4

Recv(2)

FFT(N/4)

Send(1)

Recv(1)

FFT(N/4)

P3

Recv(4)

Combine

Send(1)

Recv(1)

Send(4)

FFT(N/4)

P2

Recv(3)

Recv(1)

Combine

Send(2)

Send(3)

FFT(N/4)

Combine

P1

(KN/2B)

(N/2B)

(N/2B)

(N/4B)

(TN/4)

(N/4B)

(KN/4B)

Time taken by these can surpass the boundaries

- Rewrite the code with new data type in C
- Optimize the code
- Try with more processors ?
- Analyze using profilers ?

- Built and ran admxrc2 demos
- Studied the wrapper and vhdl codes
- Struct ADMXRC2_SPACE_INFO
- The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers.

- All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs
- This approach makes the vhdl code card-dependent

- There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls
- See which of the two approaches is more useful and work with it
- DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing)

- ADM-XRC-II user manual
- www.forums.xilinx.com
- www.fpga-faq.org