1 / 53

Parallel Spectral Methods: Solving Elliptic Problems with FFTs

Parallel Spectral Methods: Solving Elliptic Problems with FFTs. Horst Simon hdsimon@eecs.berkeley.edu www.cs.berkeley.edu/~demmel/cs267_Spr09. Motifs. The Motifs (formerly “Dwarfs”) from “ The Berkeley View” ( Asanovic et al.) Motifs form key computational patterns.

Download Presentation

Parallel Spectral Methods: Solving Elliptic Problems with FFTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Spectral Methods:Solving Elliptic Problems with FFTs Horst Simon hdsimon@eecs.berkeley.edu www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 20

  2. Motifs The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.) Motifs form key computational patterns Topic of this lecture CS267 Lecture 21 2

  3. References • Previous CS267 lectures • Lecture by Geoffrey Fox: http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt • FFTW project http://www.fftw.org • Spiral project http://www.spiral.net CS267 Lecture 21

  4. Poisson’s equation arises in many models 3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z) • Electrostatic or Gravitational Potential: Potential(position) • Heat flow: Temperature(position, time) • Diffusion: Concentration(position, time) • Fluid flow: Velocity,Pressure,Density(position,time) • Elasticity: Stress,Strain(position,time) • Variations of Poisson have variable coefficients f represents the sources; also need boundary conditions 2D: 2u/x2 + 2u/y2 = f(x,y) 1D: d2u/dx2 = f(x) CS267 Lecture 21

  5. Algorithms for 2D (3D) Poisson Equation (N = n2(n3) vars) Algorithm Serial PRAM Memory #Procs • Dense LU N3 N N2 N2 • Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) • Jacobi N2 (N5/3) N (N2/3) N N • Explicit Inv. N2 log N N2 N2 • Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N • Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N • Sparse LU N3/2 (N2)N1/2 N*log N (N4/3) N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. CS267 Lecture 21

  6. Solving Poisson’s Equation with the FFT • Express any 2D function defined in 0  x,y  1 as a series(x,y) = SjSkjk sin(p jx) sin(p ky) • Here jk are called Fourier coefficient of (x,y) • The inverse of this is:jk = 4 (x,y) sin(p jx) sin(p ky) • Poisson’s equation 2  /x2 + 2  /y2 = f(x,y) becomes SjSk (-p2j2 - p2k2) jk sin(p jx) sin(p ky) = SjSk fjk sin(p jx) sin(p ky) • where fjk are Fourier coefficients of f(x,y) • andf(x,y) = SjSkfjk sin(p jx) sin(p ky) • This implies PDE can be solved exactly algebraically, jk = fjk /(-p2j2 - p2k2) CS267 Lecture 21

  7. Solving Poisson’s Equation with the FFT • So solution of Poisson’s equation involves the following steps • 1) Find the Fourier coefficients fjk of f(x,y) by performing integral • 2) Form the Fourier coefficients of  by jk = fjk /(-p2j2 - p2k2) • 3) Construct the solution by performing sum (x,y) • There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral • Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx) • Let us first consider 1D discrete version of this case • PDE case normally deals with discretized functions as these needed for other parts of problem CS267 Lecture 21

  8. Serial FFT • Let i=sqrt(-1) and index matrices and vectors from 0. • The Discrete Fourier Transform of an m-element vector v is: • F*v • Where F is the m*m matrix defined as: • F[j,k] = v(j*k) • Where v is: • v = e (2pi/m) = cos(2p/m) + i*sin(2p/m) • v is a complex number with whose mth power vm =1 and is therefore called an mth root of unity • E.g., for m = 4: • v = i, v2 = -1, v3 = -i, v4 = 1, CS267 Lecture 21

  9. Using the 1D FFT for filtering • Signal = sin(7t) + .5 sin(5t) at 128 points • Noise = random number bounded by .75 • Filter by zeroing out FFT components < .25 CS267 Lecture 21

  10. Using the 2D FFT for image compression • Image = 200x320 matrix of values • Compress by keeping largest 2.5% of FFT components • Similar idea used by jpeg CS267 Lecture 21

  11. Related Transforms • Most applications require multiplication by both F and inverse(F). • Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) • For solving the Poisson equation and various other applications, we use variations on the FFT • The sin transform -- imaginary part of F • The cos transform -- real part of F • Algorithms are similar, so we will focus on the forward FFT. CS267 Lecture 21

  12. Serial Algorithm for the FFT • Compute the FFT of an m-element vector v, F*v (F*v)[j] = S F(j,k) * v(k) = Sv(j*k) * v(k) = S (vj)k * v(k) = V(v j) • Where V is defined as the polynomial V(x) = S xk * v(k) m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0 CS267 Lecture 21

  13. Divide and Conquer FFT • V can be evaluated using divide-and-conquer V(x) = S (x)k * v(k) = v[0] + x2*v[2] + x4*v[4] + … + x*(v[1] + x2*v[3] + x4*v[5] + … ) = Veven(x2) + x*Vodd(x2) • V has degree m-1, so Veven and Vodd are polynomials of degree m/2-1 • We evaluate these at points (v j)2 for 0<=j<=m-1 • But this is really just m/2 different points, since (v(j+m/2) )2 = (vj *v m/2) )2 = v2j *v m = (vj)2 • So FFT on m points reduced to 2 FFTs on m/2 points • Divide and conquer! m-1 k = 0 CS267 Lecture 21

  14. Divide-and-Conquer FFT FFT(v, v, m) if m = 1 return v[0] else veven = FFT(v[0:2:m-2], v 2, m/2) vodd = FFT(v[1:2:m-1], v 2, m/2) v-vec = [v0, v1, … v(m/2-1) ] return [veven + (v-vec .* vodd), veven - (v-vec .* vodd) ] • The .* above is component-wise multiply. • The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed CS267 Lecture 21

  15. An Iterative Algorithm • The call tree of the d&c FFT algorithm is a complete binary tree of log m levels • An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottom • Algorithm overwrites v[i] by (F*v)[bitreverse(i)] • Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead) FFT(0,1,2,3,…,15) = FFT(xxxx) even odd FFT(0,2,…,14) = FFT(xxx0) FFT(1,3,…,15) = FFT(xxx1) FFT(xx00) FFT(xx10) FFT(xx01) FFT(xx11) FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) CS267 Lecture 21

  16. Parallel 1D FFT • Data dependencies in 1D FFT • Butterfly pattern • A PRAM algorithm takes O(log m) time • each step to right is parallel • there are log m steps • What about communication cost? • See LogP paper for details CS267 Lecture 21

  17. Block Layout of 1D FFT • Using a block layout (m/p contiguous elts per processor) • No communication in last log m/p steps • Each step requires fine-grained communication in first log p steps CS267 Lecture 21

  18. Cyclic Layout of 1D FFT • Cyclic layout (only 1 element per processor, wrapped) • No communication in first log(m/p) steps • Communication in last log(p) steps CS267 Lecture 21

  19. Parallel Complexity • m = vector size, p = number of processors • f = time per flop = 1 • a = startup for message (in f units) • b = time per word in a message (in f units) • Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) * a + m*log(p)/p * b CS267 Lecture 21

  20. FFT With “Transpose” • If we start with a cyclic layout for first log(p) steps, there is no communication • Then transpose the vector for last log(m/p) steps • All communication is in the transpose • Note: This example has log(m/p) = log(p) • If log(m/p) > log(p) more phases/layouts will be needed • We will work with this assumption for simplicity CS267 Lecture 21

  21. Why is the Communication Step Called a Transpose? • Analogous to transposing an array • View as a 2D array of n/p by p • Note: same idea is useful for uniprocessor caches CS267 Lecture 21

  22. Complexity of the FFT with Transpose • If no communication is pipelined (overestimate!) • Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) * a was log(p) * a + m*(p-1)/p2 * b was m* log(p)/p* b • If communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply a, rather than (p-1)a. • This is close to optimal. See LogP paper for details. • See also following papers on class resource page • A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” • R. Nishtala et al, “Optimizing bandwidth limited problems using one-sided communication” CS267 Lecture 21

  23. Comment on the 1D Parallel FFT • The above algorithm leaves data in bit-reversed order • Some applications can use it this way, like Poisson • Others require another transpose-like operation • Other parallel algorithms also exist • A very different 1D FFT is due to Edelman (see http://www-math.mit.edu/~edelman) • Based on the Fast Multipole algorithm • Less communication for non-bit-reversed algorithm CS267 Lecture 21

  24. Higher Dimension FFTs • FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. • E.g., a 2D FFT does 1D FFTs on all rows and then all columns • There are 3 obvious possibilities for the 2D FFT: • (1) 2D blocked layout for matrix, using 1D algorithms for each row and column • (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs • (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns • Option 2 is best, if we overlap communication and computation • For a 3D FFT the options are similar • 2 phases done with serial FFTs, followed by a transpose for 3rd • can overlap communication with 2nd phase in practice CS267 Lecture 21

  25. FFTW – Fastest Fourier Transform in the West • www.fftw.org • Produces FFT implementation optimized for • Your version of FFT (complex, real,…) • Your value of n (arbitrary, possibly prime) • Your architecture • Close to optimal for serial, can be improved for parallel • Similar in spirit to PHIPAC/ATLAS/Sparsity • Won 1999 Wilkinson Prize for Numerical Software • Widely used for serial FFTs • Had parallel FFTs in version 2, but no longer supporting them • Layout constraints from users/apps + network differences are hard to support CS267 Lecture 21

  26. Bisection Bandwidth • FFT requires one (or more) transpose operations: • Ever processor send 1/P of its data to each other one • Bisection Bandwidth limits this performance • Bisection bandwidth is the bandwidth across the narrowest part of the network • Important in global transpose operations, all-to-all, etc. • “Full bisection bandwidth” is expensive • Fraction of machine cost in the network is increasing • Fat-tree and full crossbar topologies may be too expensive • Especially on machines with 100K and more processors • SMP clusters often limit bandwidth at the node level CS267 Lecture 21

  27. LogGP: no overlap P0 osend L orecv P1 Modified LogGP Model • LogGP: no overlap P0 g P1 EEL: end to end latency (1/2 roundtrip) g: minimum time between small message sends G: additional gap per byte for larger messages CS267 Lecture 21

  28. Historical Perspective ½ round-trip latency • Potential performance advantage for fine-grained, one-sided programs • Potential productivity advantage for irregular applications CS267 Lecture 21

  29. General Observations • The overlap potential is the difference between the gap and overhead • No potential if CPU is tied up throughout message send • E.g., no send size DMA • Grows with message size for machines with DMA (per byte cost is handled by network) • Because per-Byte cost is handled by NIC • Grows with amount of network congestion • Because gap grows as network becomes saturated • Remote overhead is 0 for machine with RDMA CS267 Lecture 21

  30. GASNet Communications System GASNet offers put/get communication • One-sided: no remote CPU involvement required in API (key difference with MPI) • Message contains remote address • No need to match with a receive • No implicit ordering required Compiler-generated code • Used in language runtimes (UPC, etc.) • Fine-grained and bulk xfers • Split-phase communication Language-specific runtime GASNet Network Hardware CS267 Lecture 21

  31. Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI • Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5 • Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in streaming bandwidth • Half power point (N/2) differs by one order of magnitude CS267 Lecture 21

  32. (up is good) GASNet: Performance for mid-range message sizes GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize Usually outperform MPI at medium message sizes - often by a large margin CS267 Lecture 21

  33. NAS FT Case Study • Performance of Exchange (Alltoall) is critical • Communication to computation ratio increases with faster, more optimized 1-D FFTs • Determined by available bisection bandwidth • Between 30-40% of the applications total runtime • Two ways to reduce Exchange cost 1. Use a better network (higher Bisection BW) 2. Overlap the all-to-all with communication (where possible) – “break up” the exchange Default NAS FT Fortran/MPI relies on #1 Our approach uses UPC/GASNet and builds on #2 • Started as CS267 project • 1D partition of 3D grid is a limitation • At most N processors for N^3 grid • HPC Challenge benchmark has large 1D FFT (can be viewed as 3D or more with proper roots of unity) CS267 Lecture 21

  34. 3D FFT Operation with Global Exchange • Single Communication Operation (Global Exchange) sends THREADS large messages • Separate computation and communication phases 1D-FFT Columns Transpose + 1D-FFT (Rows) 1D-FFT (Columns) Cachelines 1D-FFT Rows send to Thread 0 Exchange (Alltoall) Transpose + 1D-FFT send to Thread 1 Divide rows among threads send to Thread 2 Last 1D-FFT (Thread 0’s view) CS267 Lecture 21

  35. Communication Strategies for 3D FFT chunk = all rows with same destination • Three approaches: • Chunk: • Wait for 2nd dim FFTs to finish • Minimize # messages • Slab: • Wait for chunk of rows destined for 1 proc to finish • Overlap with computation • Pencil: • Send each row as it completes • Maximize overlap and • Match natural layout pencil = 1 row slab = all rows in a single plane with same destination Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea CS267 Lecture 21

  36. Decomposing NAS FT Exchange into Smaller Messages • Three approaches: • Chunk: • Wait for 2nd dim FFTs to finish • Slab: • Wait for chunk of rows destined for 1 proc to finish • Pencil: • Send each row as it completes • Example Message Size Breakdown for • Class D (2048 x 1024 x 1024) • at 256 processors CS267 Lecture 21

  37. Overlapping Communication • Goal: make use of “all the wires” • Distributed memory machines allow for asynchronous communication • Berkeley Non-blocking extensions expose GASNet’s non-blocking operations • Approach: Break all-to-all communication • Interleave row computations and row communications since 1D-FFT is independent across rows • Decomposition can be into slabs (contiguous sets of rows) or pencils (individual row) • Pencils allow: • Earlier start for communication “phase” and improved local cache use • But more smaller messages (same total volume) CS267 Lecture 21

  38. NAS FT: UPC Non-blocking MFlops • Berkeley UPC compiler support non-blocking UPC extensions • Produce 15-45% speedup over best UPC Blocking version • Non-blocking version requires about 30 extra lines of UPC code CS267 Lecture 21

  39. NAS FT Variants Performance Summary • Shown are the largest classes/configurations possible on each test machine • MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths) CS267 Lecture 21

  40. Pencil/Slab optimizations: UPC vs MPI • Same data, viewed in the context of what MPI is able to overlap • “For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation” • On Infiniband, UPC overlaps almost all the time the MPI spends in communication • On Elan3, UPC obtains more overlap than MPI as the problem scales up CS267 Lecture 21

  41. Summary of Overlap in FFTs • One-sided communication has performance advantages • Better match for most networking hardware • Most cluster networks have RDMA support • Machines with global address space support (X1, Altix) shown elsewhere • Smaller messages may make better use of network • Spread communication over longer period of time • Postpone bisection bandwidth pain • Smaller messages can also prevent cache thrashing for packing • Avoid packing overheads if natural message size is reasonable CS267 Lecture 21

  42. free software:http://www.fftw.org/ the “Fastest Fourier Tranform in the West” FFTW • C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) • Computational kernels (80% of code) automatically generated • Self-optimizes for your hardware (picks best composition of steps) = portability + performance CS267 Lecture 21

  43. FFTW performancepower-of-two sizes, double precision 833 MHz Alpha EV6 2 GHz PowerPC G5 500 MHz Ultrasparc IIe 2 GHz AMD Opteron CS267 Lecture 21

  44. FFTW performancenon-power-of-two sizes, double precision unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6 2 GHz AMD Opteron …because we let the code do the optimizing CS267 Lecture 21

  45. FFTW performancedouble precision, 2.8GHz Pentium IV:2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two …because we let the code write itself CS267 Lecture 21

  46. 3 1 2 Why is FFTW fast?three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” CS267 Lecture 21

  47. Key fact: usually, many transforms of same size are required. FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ... execute(p); /* repeat as needed */ ... destroy_plan(p); } CS267 Lecture 21

  48. Why is FFTW fast?three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 CS267 Lecture 21

  49. But traditional implementation is non-recursive, breadth-first traversal: log2n passes over whole array FFTW Uses Natural Recursion Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT CS267 Lecture 21

  50. Traditional cache solution: Blocking Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache …requires program specialized for cache size CS267 Lecture 21

More Related