1 / 18

Parallel FFT

Parallel FFT. Sathish Vadhiyar. Sequential FFT – Quick Review. Twiddle factor – primitive n th root of unity in complex plane -. Sequential FFT – Quick Review. (n/2) th root of unity 2 (n/2)-point DFTs.

jenski
Download Presentation

Parallel FFT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel FFT Sathish Vadhiyar

  2. Sequential FFT – Quick Review • Twiddle factor – primitive nth root of unity in complex plane -

  3. Sequential FFT – Quick Review • (n/2)th root of unity • 2 (n/2)-point DFTs

  4. Sequential FFT – quick review X(0) Y(0) wn0 wn0 wn0 X(4) Y(1) wn2 wn1 wn4 X(2) Y(2) wn4 wn2 wn0 X(6) Y(3) wn6 wn3 wn4 X(1) Y(4) wn0 wn0 wn4 X(5) Y(5) wn4 wn2 wn5 X(3) Y(6) wn0 wn4 wn6 X(7) Y(7) wn4 wn6 wn7

  5. Sequential FFT – recursive solution 1. procedure R_FFT(X, Y, n, w) 2. if (n=1) then Y[0] := X[0] else 3. begin 4. R_FFT(<X(0), X(2), …, X[n-2]>, <Q[0], Q[1], …, Q[n/2]>, n/2, w2) 5. R_FFT(<X(1), X(3), …, X[n-1]>, <T[0], T[1], …, T[n/2]>, n/2, w2) 6. for i := 0 to n-1 do 7. Y[i] := Q[i mod (n/2)] + wiT(i mod (n/2)]; 8. end R_FFT

  6. Sequential FFT – iterative solution 1. procedure ITERATIVE_FFT(X, Y, n) 2. begin 3. r := log n; 4. for i:= 0 to n-1 do R[i] := X[i]; 5. for m:= 0 to r-1 do 6. begin 7. for i:= 0 to n-1 do S[i] := R[i]; 8. for i:= 0 to n-1 do 9. begin /* Let (b0, b1, b2, … br-1) be the binary representation of i */ 10. j := (b0 … bm-10bm+1 .. br-1); 11. k := (b0 … bm-11bm+1 .. br-1); 12. R[i] := S[j] + S[k] x w(bmbm-1..b00..0) ; 13. endfor; 14. endfor; 15. for i:= 0 to n-1 do Y[i] := R[i]; 16. end ITERATIVE_FFT

  7. Example of w calculation For a given m and i, the power of w is computed by reversing the order of the m+1 most significant bits and padding them by 0’s to the right.

  8. Parallel FFT – Binary exchange X(0) Y(0) 000 P0 001 X(1) Y(4) 010 X(2) Y(2) P1 011 X(3) Y(6) 100 X(4) Y(1) P2 101 X(5) Y(5) X(3) Y(3) 110 P3 111 X(7) Y(7) d r

  9. Binary Exchange • d – number of bits for representing processes; r – number of bits representing the elements • The d most significant bits of element i indicate the process that the element belongs to. • Only the first d of the r iterations require communication • In a given iteration, a process communicates with only one other process • Total execution time - ? (n/P)logN + logP(l) + (n/P)logP (b)

  10. Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 m = 0 m = 1 Phase 1 – FFTs along columns

  11. Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 4 8 12 4 5 6 7 1 5 9 13 8 9 10 11 2 6 10 14 12 13 14 15 3 7 11 15 Phase 2 – Transpose

  12. Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 4 8 12 0 4 8 12 1 5 9 13 1 5 9 13 2 6 10 14 2 6 10 14 3 7 11 15 3 7 11 15 m = 2 m = 3 Phase 3 – FFTs along columns

  13. 2D Transpose • In general, n elements arranged as √n x √n • p processes arranged along columns. Each process owns √n/p columns • Each process does √n/p FFTs of size √n each • Parallel runtime – 2(√n/p)√nlog√n + (p-1)(l)+ n/p(b)

  14. 3D Transpose • n1/3 x n1/3 x n1/3 elements • √p x √p processes • Steps ? • Parallel runtime – (n/p)logn(c) + 2(√p-1)(l) + 2(n/p)(b)

  15. In general • For q dimensions: • Parallel runtime – (n/p)logn + (q-1)(p1/(q-1) -1) [l]+ (q-1)(n/p) [b] • Time due to latency decreases; due to bandwidth increases • For implementation – only 2D and 3D transposes are feasible. Moreover, there are restrictions on n and p in terms of q.

  16. Choice of algorithm • Binary exchange – small latency, large bandwidth • 2D transpose – large latency, small bandwidth • Other transposes lie between binary exchange and 2D transpose • For a given parallel computer, based on l and b, different algorithms can give different performances for different problem sizes

  17. FFTW • Self-optimizing package • Adapting FFT computation to the details of the hardware • FFTW has codelet library. • Executor is the implementation of C-T algorithm composed of certain Codelets • Composition of executor is the plan. • FFTW has codelets for computing FFTs of integers upto 16 and powers of 2 upto 64.

  18. FFTW • E.g., for solving a 128-point FFT: • divide-and-conquer(128, 4) • divide-and-conquer(32,8) • solve(4) • Optimal plan depends on processor, memory architecture and compiler • Uses search space techniques to conduct experiments with a select set of plan and obtain the optimum.

More Related