Parallel FFT and Use in PDE Solvers

CPS615 Computational Science Spring Semester 2000 Geoffrey Fox Northeast Parallel Architectures Center Syracuse University 111 College Place Syracuse NY gcf@npac.syr.edu gcf@cs.fsu.edu Parallel FFT and Use in PDE Solvers cps615fft00

Abstract of CPS615 FFT Lectures • We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson’s equation • The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes • We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT. • For binary case, we go through parallelism and use of cache giving a performance analysis • These lectures motivate later lectures on Fast Multipole method as general Green’s function solver which is more flexible than FFT cps615fft00

References for Fast Fourier Transform • Lecture 24 Demmel 1999 CS267 UC Berkeley • http://www.cs.berkeley.edu/~demmel/cs267_Spr99 • FFTW (Fastest Fourier Transform in the West) at http://www.fftw.org • Seems to be best (sequential) FFT routine and has a lot of links to other sequential and parallel one and multidimensional FFT’s • http://www.eptools.com by Engineering Productivity Tools Inc. has a good tutorial on the FFT • Aimed at Signal Processing but basic description generally useful • Chapter 11 of “Solving Problems on Concurrent Processors, Volume 1”, Prentice Hall 1988. (Chapter written by John Salmon) • Fox, Johnson, Lyzenga, Otto, Salmon, Walker cps615fft00

Solving Poisson’s Equation with the FFT I • Express any 2D function defined in 0  x,y  1 as a series(x,y) = SjSkjk sin(p jx) sin(p ky) • Here jk are called Fourier coefficient of (x,y) • The inverse of this is:jk = 4 (x,y) sin(p jx) sin(p ky) • Poisson’s equation 2  /x2 + 2  /y2 = f(x,y) becomes SjSk (-p2j2 - p2k2) jk sin(p jx) sin(p ky) = SjSk fjk sin(p jx) sin(p ky) • where fjk are Fourier coefficients of f(x,y) • andf(x,y) = SjSkfjk sin(p jx) sin(p ky) • This implies PDE can be solved exactly algebraically, jk = fjk /(-p2j2 - p2k2) cps615fft00

Solving Poisson’s Equation with the FFT II • So solution of Poisson’s equation involves the following steps • 1) Find the Fourier coefficients fjk of f(x,y) by performing integral • 2) Form the Fourier coefficients of  by jk = fjk /(-p2j2 - p2k2) • 3) Construct the solution by performing sum (x,y) • There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral • Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx) • Let us first consider 1D discrete version of this case • PDE case normally deals with discretized functions as these needed for other parts of problem cps615fft00

Discrete Fourier Transform • Let f be in 1D a typical function defined on a grid and labeled with index m= 0 … N-1 • Let i = sqrt(-1) and index matrices and vectors from 0. • The Discrete Fourier TransformG(f) of an N-element vector f is another vector of length N given by Matrix Vector Multiplication  f • Where  is the N*N matrix with matrix elements: • km = v(-k*m) • and v is: • v = e (2pi/N) = cos(2p/N) + i*sin(2p/N) • This is a complex number with whose Nth power is 1 and is therefore called the Nth root of unity • E.g., for N = 4: • v = 0+1*i, v2 = -1+0*i, v3 = 0-1*i, v4 = 1+0*i, cps615fft00

Inverse and other FFT Transforms • f can be reconstructed from its discrete Fourier transformation G(f) by • f = *. G(f)/N where * denotes complex conjugation • * is anN byNmatrix and G(f) is a vector of length N • Most applications require both calculating FFT’s and reconstructing functions from their FFT’s • However these are essentially the same algorithm as seen above and so we only need to illustrate one case • Issues with parallelism and optimal performance are identical • For solving the Poisson equation and various other applications, we use variations on the FFT • The sin transform – use imaginary part of G • The cos transform – use real part of G cps615fft00

The 1D FFT in explicit detail I • The basic discrete FFT is: • This in a naïve analysis takes O(N2) complex floating point additions and multiplications. However we can group terms together to reduce this to a time O(NlogN). • This can be done in many ways and starts with a factorization N=N1N2 (product of integers) and a recursive iteration of factorization to values N1, N2 where special algorithms exist. • The power of this general approach is made clear at FFTW web site • eptools web site describes case N1=2, 4 and a general prime cps615fft00

The 1D FFT in explicit detail II • There are two special cases of particular importance • Decimation in frequency (DIF) N1=2 N2=N/2 where frequency space is GN( k,f ) • Decimation in time (DIT) N1=N/2 N2=2 where time domain is f(m) • These are used in cases where N=2dis a power of 2 and decimations can be applied recursively d times. • You can apply N=2dto a general value of N by taking say N=773 and adding 251 zeros to make it equal to 1024. • This can be wasteful especially in multi dimensional transforms where you waste space in x y and z directions • The binary FFT illustrates issues and has much simpler algebra -- FFTW uses a special purpose compiler to cope with the general case cps615fft00

DIF Manipulation of the exponential I • Consider the exponential term and divide indices k and m into two terms • Then term is just what we need for an FFT of size N/2 cps615fft00

DIF Manipulation of the exponential II • We now will see that factor S with give us a sum of two such FFT’s with particular algebraic forms for the functions to be Fourier transformed • Now is +1 when =0 • and -1 when =1 • Further cps615fft00

Recursive Formula for DIF • Define two functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f • fE fO are gotten by summing and multiplying as instructed by function S on previous foil cps615fft00

The Computational Complexity of DIF FFT • We can now apply the same idea recursively; chop off the lowest binary digit of k so as to derive a formula for each FFT in terms of two more half the size. • Let C(N) be computational complexity of the FFT of size N • Let Tbutterfly = 2T+ + T* be the time to calculate fEand fO for one value of m- • Here T+is time for a complexaddition andT* the time for a complex multiplication. • We assume that complex number TN(m-) has been precalculated and is available by table lookup • Then C(N) = 2C(N/2) + (N/2) Tbutterfly • This can be solved as C(N) =NlogNTbutterfly/2 cps615fft00

DIT Manipulation of the exponential I • Consider the exponential term again and divide indices k and m into two terms in a different way • Then again term is just what we need for an FFT of size N- =N/2 cps615fft00

DIT Manipulation of the exponential II • We now will see that factor S with give us a sum of two different (from DIF) N/2 length FFT’s • Now is +1 when =0 • and -1 when =1 • Further cps615fft00

Recursive Formula for DIT • Define two (different from DIF) functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f • fE fO are gotten by summing and multiplying as instructed by function S on previous foil cps615fft00

FFT(0,1,2,3,…,15) = FFT(xxxx) fE fO FFT(0,2,…,14) = FFT(xxx0) FFT(1,3,…,15) = FFT(xxx1) FFT(xx00) FFT(xx10) FFT(xx01) FFT(xx11) FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) Recursive Structure for DIT • The recursive structure of the binary FFT algorithm is a complete binary tree of logN levels shown below for N=16 • At each phase, one relates the FFT of M numbers to FFT of even and odd subsets fEfO of this set -- these are of size M/2 • Diagram shows binary digit (XXXX) of relevant elements • e.g. XXX1 are all odd indices m; XX11 is set 4m-+3where m- lies between 0 and 3 cps615fft00

…… …… m2 kd-1 k2 md-1 md-2 m1 k1 kd-2 md-3 k0 kd-3 m0 What’s Going On with DIT/DIF I • DIT and DIF have identical computational complexity analyses at the superficial level we did so far • We will see some important features when we analyze where the data is stored and issues of good cache use and parallelism • First we will discuss a way of thinking about binary FFT’s which will make the future discussion much clearer • Critical idea is to represent indices by their binary representation shown below as d digit words for k and m: • e.g. k as a binary word m as a binary word cps615fft00

…… …… md-1 m2 md-1 m2 m1 md-2 m1 md-2 md-3 kd-1 m0 md-3 What’s Going On with DIT/DIF II • Here of course vector size N = 2d • For DIT, the first step, manipulates (in “butterfly fashion”), the lowest component m0 for time and creates the highest binary digit kd-1 for frequency domain. • Note that most FFT implementation overwrite f[m] by its FFT GN(k,f) as they go along • This implies that we start with f(m) where m labeled like • And end with that start of a recursive FFT labeled like: • Note characteristic bit reversal in storage of GN(k,f) “Wrong End” cps615fft00

…… …… kd-3 md-1 m2 k0 kd-2 k1 m1 md-2 k2 kd-1 md-3 m0 What’s Going On with DIT/DIF III • We iterate this idea with phases p running from 0 to d-1. At the p’th phase, one is manipulating the p’th digit of f(m) and forming the d-1-p’th index of GN(k,f). • Thus starting with f(m) with binary labeling: • One ends up with GN(k,f) with totally reversed index scheme • Note that in recursive mode, one starts forming recursion with phase p=0 but first computations are done at phase p=d-1 where one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0 cps615fft00

Butterfly Patternin 1D DIT FFT • Data dependencies in 1D FFT show a characteristic butterfly structure • We show 4 processing phases p • At phase p for DIT, one manipulates p’th digit of f(m) Index m with binary labels Phase 3 2 1 0 cps615fft00

... … md-1 md-2 md-3 md-P+1 md-P m2 m1 m0 md-P-1 Basic Parallel FFT Algorithms I • Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT’s at phase d-1 and ending with combining two N/2 point FFT’s at phase 0. • However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later • We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil. • Let Nproc = 2P be the number of processors in the parallel computer • Consider the d digits in binary representation of m • The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P Pdigits labeling 2P processors d-P digits labeling position in processor cps615fft00

Basic Parallel FFT Algorithms II • Remembering that first computations are performed at phase d-1, we see that one needs communication for the first few (Pphases) as dependency lines cross boundaries. For the final phases all the information needed for FFT component is stored within processor. • So communication needed for phases d-1 through d-P • These are phases labeled by the digits used to label Processor and not position in the processor • No communication at all for phases d-P-1 through 0 • Note end result -- the FFT GN(k,f) -- is bit reversed: we will discuss later the communication cost of undoing bit reversal but • In many cases, bit reversal is unimportant and FFT does not need to be reversed. • For instance in solving Poisson’s equation, FFT is followed by inverse FFT and in final result, there is no bit reversal cps615fft00

Binary Representation of Index m Processor Number Phase 3 2 1 0 Parallelismin 1D FFT 0 • Consider 4 Processors and natural block decomposition • note phases 2 and 3 have communication as dependency lines cross processor boundaries Processor Boundary 1 Processor Boundary 2 Processor Boundary 3 cps615fft00

Performance of Simplest Parallel DIT FFT I • As discussed we have two types of phases • The sequential time Tsequential (phase p)is identical for each phase p at NTbutterfly/2 • Tparallel (phases 0 p d-P-1) = NTbutterfly/(2Nproc) as perfectly parallel (load balanced) • At the remaining stages, one must communicate in each of computations where fE and fO(DIT)are exchanged between two processors. This must be done for every one of the N/Nproc points stored in each processor cps615fft00

Performance of Simplest Parallel DIT FFT II • This shows how processor holding fE fetches fO and forms in-place (i.e. in location originally holding fE) the FFT component with kd-q-1 = 0 (stored in place mq=0). • Similarily processor holding fOforms the FFT component with kd-q-1 = 1. cps615fft00

Performance of Simplest Parallel DIT FFT III • Tparallel (phases d-P p< d) = N(Tcalc+Tcomm)/(Nproc) and again this is perfectly parallel (load balanced) • Tcommis time to swap one complex word between 2 processors (to reduce latency this would be done as a block transfer of N/Nproc words) • Tcalc =T+ + T* as each processor must multiply and add/subtract. Note this implies that one loses parallelism in multiplication as both processors must do it. • Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 cps615fft00

Performance of Simplest Parallel DIT FFT IV • One can however for DIT (and DIF) get much better performance if one avoids the “owner computes rule” and changes the processor which calculates a given FFT component. • This can seen by examining a typical step in a phase of parallel FFT where communication is needed in current algorithm. Consider any two processors -- called a and b --- which need to swap data in current algorithm • ahas vector fa and b has a vector fb-- bothof length N/Nproc • Currently we swap N/Nproc numbers by sending fb to a and fato b cps615fft00

Performance of Simplest Parallel DIT FFT V • Alternatively we could send even members of fa to b and odd indexed entries in fb to a. • We would take resultant vector members in processors a and b and combine them in pairs to get FFT components • Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 is now given by We have avoided load imbalance and halved the communication compared to simple algorithm. In later foils we will find even better methods that get rid of log2N term cps615fft00

What’s Going On with DIT/DIF IV • For DIF, the phases are of course reversed in the way the step through the binary digits of f(m) and GN(k,f) • One still has d phases labeled by p • At first recursionp=0 one is manipulating f(m) with highest binary index md-1and forming GN(k,f) with binary index k0 • The first DIF computations are done atend of recursion and this is “natural” m=0 and kd-1index for GN(k,f) • DIF has an identical parallel performance to DIT although some of the steps differ in detail cps615fft00

f(m) In location …… md-1 m2 m1 md-2 m0 md-3 …… mp(d-1) mp(d-2) mp(d-3) mp(2) mp(1) mp(0) Sequential and Parallel Performance I • Now to understand performance of general algorithms in parallel or in caches, one must understand storage of the arrays. • Note first that labeling is not necessarily same as storage • if p(l) is any permutation of l=0…d-1, then we can store • This can be used to improve sequential performance by ensuring that when a given digit l is being processed, its storage digit p(l) is in range 0 p(l) < L, where the cache can hold 2L entries cps615fft00

Sequential and Parallel Performance II • This implies a reordering of computations, so that L digits are done at a time and one fixes storage indices d-1, d-2, d-3, .. L and calculates 2L point FFT on lower indices • This 2L point FFT can be calculated by iterating 2 point FFT’s as in Cooley Tukey or by some custom FFT of this size as in FFTW • Clearly this makes excellent use of cache and each data point is used L times (multiplied by number of times used in basic single digit FFT -- 2.5 times on average per 1 floating point word) • So basic operation needed is one of reordering: • Move L digits from current positions to another set of L positions • Assume these are not overlapping • Time taken is 2(d-P).2(complex floats) . (1-2L) with one memory read and one write each • This re-ordering is also algorithm used to undo bit reversal if needed cps615fft00

.. ... ... …. mg(0) mg(1) mg(L-1) mL-1 mL-2 m0 md-1 md-2 md-3 Sequential and Parallel Performance III • Interesting point about swapping time is that it is independent of L except for the factor (1-2L) which is a factor of two increase from L=1 to largest L • After spending this time, one can then do “in-cache” computations even in an iterative algorithm • Computations are larger by a factor L (roughly log of Cache Size) compared to cost of re-ordering cps615fft00

... ... …. ... md-P mP-1 mP-2 mP-3 m0 md-1 md-2 md-3 Sequential and Parallel Performance IV • We can use the same idea to improve the parallel algorithm • Now the top P=logNproc digits label processor number and we transfer these to digits stored inside processor. This involves communication but after this transfer, all computations are internal to a processor • Total Communication time is 2(d-P). (1-2L) . Tcomm cps615fft00

Sequential and Parallel Performance V • An interesting point about the resultant communication overhead is that one no longer gets the logNprocdependence in the numerator • Remember that in D dimensions, we found fcommproportional to n-1/D where n is grain size • For FTT fcomm is proportional to 1/log2n which is “standard” formula with D= cps615fft00

…… ….. ….. xP-1 x2 xP-2 x1 x0 xP-3 x0 x0 x2 x2 xP-1 xP-1 xP-2 xP-2 x1 x1 1 0 Parallel FFT and Hypercubes I • In days gone by, We used to use hypercube topology for parallel machines • Such machines have 2P processor nodes labeled • and the communication channels are such that every node x is connected to P other nodes -- connections run to those nodes whose binary representations differs in just one digit. For instance and are connected cps615fft00

Parallel FFT and Hypercubes II • The relevance of this to Cooley Tukey FFT can be seen from basic recursive step which precisely involves linking Fourier coefficients which also just differ by one in a single binary digit • Hypercube had other useful features e.g. the 64 node 6 dimensional hypercube includes meshes (8 by 8, 4 by 4 by 4, 2 by 8 by 4 etc.) • However nowadays one builds general purpose switches (communication systems) and does not focus on particular communication topologies Seitz and Fox with Caltech 64 node Hypercube 1983 Hypercube Topology for 8 machines cps615fft00

…. …. md-d2 md1-1 md1-2 md1-3 m0 md-1 md-2 md-3 Multi Dimensional FFT’s • Multi-dimensional FFT’s are very common and the most important for parallel computing as they are more computationally intense • Images naturally give two dimensional FFT’s • Take a 2d2 by 2d1 FFT in two dimensions and label it by a d digit binary number with d=d1+d2 • Then all discussions are essentially identically to one dimensional FFT with dimension d • The algorithm is different but communication issues and strategy of re-ordering binary index are same • In particular one would normally first do the d1 transform in say low d1 bits and then swap upper d2 bits into lower part of d digit label -- now do d2 digit transform. This way one again only gets communication in swap and otherwise one has perfectly one dimensional parallel FFT’s inside each processor cps615fft00

Parallel FFT and Use in PDE Solvers

Parallel FFT and Use in PDE Solvers

Presentation Transcript

Parallelizing finite element PDE solvers in an object-oriented framework

CSE 690: GPGPU Lecture 8: Image Processing PDE Solvers

Use of SMT Solvers in Verification

Towards Optimal Solvers for PDE-Constrained Optimization

Scalable Solvers and Software for PDE Applications

A Software Framework for Easy Parallelization of PDE Solvers

Parallel FFT

11. Use FFT to Design DF

Fast Parallel and Adaptive Updates for Dual-Decomposition Solvers

Parallel Method for solving a Coupled PDE

Scalable Solvers and Software for PDE Applications

Real-time multi-core PDE-solvers in LabVIEW

FFT in Hardware and Software

Use Parallel Lines and Transversals

FFT

Polynomials and FFT

Parallel Processing Final Project Parallel FFT using to solve Poisson’s Equation

Scalable Solvers and Software for PDE Applications

Parallel FFT in Julia

Parallel FFT

A Software Framework for Easy Parallelization of PDE Solvers