Fft accelerator project
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

FFT Accelerator Project PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

FFT Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). September 27 th ,2007. Overview. Multiprocessor Implementation Problems faced Solutions Results FPGA IO Work done Problems faced Possible solutions. MultiprocessorFFT: Problems.

Download Presentation

FFT Accelerator Project

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fft accelerator project

FFT Accelerator Project

Rohit Prakash (2003CS10186)

Anand Silodia (2003CS50210)

September 27th,2007


Overview

Overview

  • Multiprocessor Implementation

    • Problems faced

    • Solutions

    • Results

  • FPGA IO

    • Work done

    • Problems faced

    • Possible solutions


Multiprocessorfft problems

MultiprocessorFFT: Problems

  • The previous code worked for some inputs but not all

  • The program seemed to communicate well but still error prone

  • Lots of segmentation faults (even after getting the results)

    • Serial debugger does not work

    • Commercial debuggers available, but evaluation is restricted to single IP, 30 days


Suggested solutions lam mpi google groups

Suggested solutions (lam-mpi/google groups)

  • “Execution Environment does not match the compile environment”

  • Same code worked with MPICH version 2, GCC

  • Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me)

  • Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked)


System info identical for all

System Info (Identical for all)

  • Machine 1: Saveri

  • Machine 2: Abhogi

  • Machine 3: Sahana

  • Machine 4: Jaunpuri

  • Sysinfo :

    • Intel Pentium 4, 3.4 GHz

    • Cache Size: 2048KB

    • RAM 1GB

    • Operating System : Fedora Core 6

    • Compiler : mpic++

    • Flags: -O3 –march=pentium4

    • FFT : radix 2


Theoretical execution time

Theoretical Execution time

  • For p processors, the total execution time is :

    (TN/p) + (1 – 1/p)(2N/B + KN)

    • p is a power of 2

  • TN is the time taken to compute the FFT of input size N

  • KN is the time taken to combine two N-point FFT’s

  • B is the network bandwidth (bytes/sec)


  • Nature of this function

    Nature of this function

    • Sum of two functions –

      • (TN/p)

      • (1 – 1/p)(2N/B + KN)

    • When (TN/p) dominates

    • When (1 – 1/p)(2N/B + KN) dominates


    Input 8388608

    Input: 8388608


    Input 83886081

    Input: 8388608


    Input 83886082

    Input: 8388608


    Input 16777216

    Input: 16777216


    Input 167772161

    Input: 16777216


    Input 167772162

    Input: 16777216


    Input 33554432

    Input: 33554432


    Input 335544321

    Input: 33554432


    Input 335544322

    Input: 33554432


    Input 67108864

    Input: 67108864


    Input 671088641

    Input: 67108864


    Input 671088642

    Input: 67108864


    Inference

    Inference

    • Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup)

    • Below this point

      • the execution time increases with the increase in # processors

      • the %age communication time decreases as the #processors increase

    • Above this point

      • the execution time decreases with the increase in #processors

      • the %age communication time increases as the #processors decreases


    Possible errors

    Possible errors

    • Measuring real time which is affected by the load on a particular processor

    • Network Communication latency affects the time taken to establish a synchronous handshake

    • The pipeline is actually not “perfect”


    4 processor pipelined layout

    4 processor pipelined layout

    Send(2)

    P4

    Recv(2)

    FFT(N/4)

    Send(1)

    Recv(1)

    FFT(N/4)

    P3

    Recv(4)

    Combine

    Send(1)

    Recv(1)

    Send(4)

    FFT(N/4)

    P2

    Recv(3)

    Recv(1)

    Combine

    Send(2)

    Send(3)

    FFT(N/4)

    Combine

    P1

    (KN/2B)

    (N/2B)

    (N/2B)

    (N/4B)

    (TN/4)

    (N/4B)

    (KN/4B)

    Time taken by these can surpass the boundaries


    Further work

    Further Work

    • Rewrite the code with new data type in C

    • Optimize the code

    • Try with more processors ?

    • Analyze using profilers ?


    Fpga pci io

    FPGA: PCI IO

    • Built and ran admxrc2 demos

    • Studied the wrapper and vhdl codes

    • Struct ADMXRC2_SPACE_INFO

      • The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers.


    Mapping to logical space

    Mapping to logical space

    • All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs

    • This approach makes the vhdl code card-dependent


    Fpga next step

    FPGA: Next step

    • There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls

    • See which of the two approaches is more useful and work with it

    • DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing)


    References

    References

    • ADM-XRC-II user manual

    • www.forums.xilinx.com

    • www.fpga-faq.org


    Thank you

    Thank you


  • Login