Communication Lower Bound for the Fast Fourier Transform Michael Anderson

Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Sources • J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM. • J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In COCOON, pages 270--281, 1995. • CS256 Applied Theory of Computation Brown University. Lecture 18 (http://www.cs.brown.edu/courses/csci2560/lectures/lect.18.MemoryHierarchyIII.pdf) • John E. Savage Models of Computation Exploring the Power of Computing • A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988.

Outline • Fast Fourier Transform • Lower bound • Two-level pebble game • S-span • Upper bound • Multilevel pebble game • Open Problems

Discrete Fourier Transform Output Vector Input Vector

Unroll Output Vector . . .

Unroll Input Vector

Phrase as Matrix-Vector Multiply INPUT VECTOR OUTPUT VECTOR

Factorization INPUT VECTOR DFT OUTPUT VECTOR

Factorization INPUT VECTOR +* x0 x0 DFT +* +* +* x1 OUTPUT VECTOR x1 +* DFT +* +* +*

Factorization INPUT VECTOR +* +* DFT +* +* +* +* DFT +* +* OUTPUT VECTOR +* +* DFT +* +* +* +* DFT +* +*

FFT Shuffle Compute +* +* +* +* +* +* +* +* +* +* +* +* INPUT VECTOR +* OUTPUT VECTOR +* +* +* +* +* +* +* +* +* +* +*

Red Blue (2-level) Pebble Game • Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose) • Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies • Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache.

Red Blue (2-level) Pebble Game • Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose) • Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies • Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache. Red Pebble (Fast Memory) Blue Pebble (Slow Memory)

Rules of the Red Blue Pebble Game • (Initialization) A blue pebble can be placed on any input vertex at any time • (Input) A red pebble may be placed on any vertex that contains a blue pebble • (Output) A blue pebble may be placed on any vertex that contains a red pebble • (Computation) A red pebble can be placed on any vertex if all of its immediate predecessors have red pebbles • (Deletion) A pebble can be removed at any time • (Goal) All output vertices contain blue pebbles

Playing the Game • A pebbling strategy is a sequence of steps in which the rules on the previous slide are used to move pebbles • The number of red pebbles (size of fast memory) is limited to S (assume infinite blue pebbles). • A communication lower bound (or Minimum I/O Time) is determined by proving the minimum number of (Input) and (Output) rules invoked over all possible pebbling strategies. • The total number of computation steps should also be minimized

S-span • The S-span of DAG G, ρ(S,G), is the maximum number of vertices of G that can be pebbled with S red pebbles in red pebble game maximized over all initial placements of S red pebbles. • Red pebble game is like the red blue game but blue pebbles cannot be stored on intermediate vertices. Initial red pebble (S=6) Red Pebble

Using S-span for Lower Bounds Divide the computation into h sub-pebblings (C1, C2...Ch) that each communicate no more than S words between level 1 and 2. Each sub-pebbling has 2S words available (S words initially in the cache plus S inputs). Therefore, each sub-pebbling can perform no more than ρ(2S,G) operations. C1 C2 C3 C4 C6 . . . Ch C5 Level-1 ops Input Output

Using S-span for Lower Bounds • Theorem For every pebbling P of G = (V,E) in the red-blue pebble game with S red pebbles, the I/O time used, T2(S,G,P) satisfies: Number of words moved (In batches of S words) Total number of operations Upper bound on arithmetic intensity (number of operations per 2S words)

What is the S-span of the FFT DAG? Lemma 1: The S-span of the FFT DAG on n inputs is no greater than 2 S log(S) when S <n. Proof: Let num(p) denote the number of moves currently allocated to pebble p. Both p1 and p2 are moved to the upper level nodes v1, and v2. (Illegal, but an upper bound) If num(p1) = num(p2) then increment both. Otherwise increment the smaller. The total number of red pebbling moves is therefore bounded by: v1 v2 u1 u2 p1 p2

What is the S-span of the FFT DAG? Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2num(p)

What is the S-span of the FFT DAG? Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2num(p) Proof (Induction): Base case: num(p) = 1. In this case, the node n needed 2 inputs. Inductive step: Assume that N(p) is at least 2e-1 for some value of num(p) < e-1. Show that N(p) becomes at least 2e when num(p) is incremented to e during a butterfly operation.Case 1: Pebbles p1 and p2 enter a butterfly operation with num(p1)=num(p2)=e-1. Since u1 and u2 are roots of disjoint trees with at least 2e-1 initial pebbles, the total number of initial pebbles is now 2(2e-1) = 2e pebbles. Case 2: num(p) < num(partner) in the butterfly. num(partner) >e therefore the partner must have been connected to at least 2e initial pebbles.

What is the S-span of the FFT DAG? There are S pebbles and each pebble can only cover one initial placement. Therefore num(p) <log(S), because there must be at least 2num(p) initial pebbles. (Lemma 2) According to Lemma 1, the total number of pebbling moves is bounded by: So the S-span is 2 S log(S). QED

FFT Two-level Hierarchy Lower Bound Number of words moved

Transpose FFT

Transpose FFT (Upper Bound) Suppose the FFT size is a power of 2. (N = 2d) There are log(N) levels in the FFT DAG. Divide the large FFT into many FFTs of size S, where S is the size of fast memory. There are log(N)/log(S) stages of independent size-S FFTs. After each stage, store the outputs in slow memory for a total of N log(N)/log(S) words moved between fast and slow memory, which achieves the lower bound.

Multilevel Pebble Game • Red/blue pebble game was for 2 levels (fast and slow) • For multilevel game, data begins and ends in the highest level memory (the Lth) and can be transferred between consecutive levels (l-1 to l or vice versa) Level-1(Registers) Level-2(On-chip cache) . . . Level-L(Main Memory)

Rules of the Multilevel Pebble Game • (Initialization) A level-L pebble can be placed on any input vertex at any time • (Computation) A first-level pebble can be placed on any vertex if all of its immediate predecessors have first-level pebbles • (Deletion) Except for level-L pebbles on output vertices, a pebble at any level can be removed at any time • (Input from level-l) For 2 <l< L-1, a level-(l-1) pebble can be placed on any vertex carrying a level-l pebble • (Output to level-l) For 2 <l< L-1, a level-(l) pebble can be placed on any vertex carrying a level-(l-1) pebble • (Goal) All output vertices contain level-L pebbles

Terminology • Resource Vector p= (p1, p2, p3, ... pL-1) where plis the number of pebbles at level l. (Highest level is assumed infinite) • sl = sum of all available pebbles below level-l • Minimal Pebbling assumes that the number of highest level I/O operations is minimized, the number of I/O operations is minimized at successively lower levels and number of computation steps is minimized. • Tl = Number of I/O operations at level l

Multilevel S-Span Theorem: Consider a minimal pebbling of the DAG G = (V,E) in the standard memory hierarchy game with resource vector p using sl pebbles at level l or less. The following lower bound must be satisfied: Level l sub-pebblings C1 C2 C3 C4 C6 . . . Ch C5 Level l-1 ops Input Output

Relating Multilevel to 2-level Theorem: The following inequality holds for 2 <l< L-1 when the graph G is pebbled in the L-level game with resource vector p.

Review • The minimum I/O time for the FFT in the 2-level case is N log N / log S • This was determined by finding the S-span of the FFT graph using it to bound the number of words transferred between memory levels • The standard FFT algorithm achieves this lower bound (so the lower bound is tight) • Two-level lower bounds can be generalized to multi-level memory hierarchies

Open Problems • Communication lower bounds for 2-D and 3-D FFTs • I suspect that S-span argument also holds for 2-D case • What if S is larger than one row? • Determining the FFT lower bound for the parallel model described in this class • Lower bounds for a “parallel hierarchal memory model” using randomized sorting algorithms for communication can be found here: J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory II: Hierarchical Multilevel Memories” • Using the pebble game (S-span) method to analyze new algorithms • Matrix Multiply and sorting and several other examples can be found in the references listed earlier

Questions?

Communication Lower Bound for the Fast Fourier Transform Michael Anderson