Communication costs of LU decomposition algorithms for banded matrices

Communication costs of LU decomposition algorithms for banded matrices RazvanCarbunescu

Outline (1/2) • Sequential general LU factorization (GETRF) and Lower Bounds • Definitions and Lower Bounds • LAPACK algorithm • Communication cost • Summary • Sequential banded LU factorization (GBTRF) and Lower Bounds • Definitions and Lower Bounds • Banded format • LAPACK algorithm • Communication cost • Summary • Sequential LU Summary

Outline (2/2) • Parallel LU definitions and Lower bounds • Parallel Cholesky algorithms (Saad, Schultz ‘85) • SPIKE Cholesky algorithm (Sameh’85) • Parallel banded LU factorization (PGBTRF) • ScaLAPACKalgorithm • Communication cost • Summary • Parallel banded LU and CholeskySummary • Future Work • General Summary

GETRF – Definitions and Lower Bounds • Variables: • n - size of the matrix • r - block size (panel width) • i- current panel number • M - size of fast memory • fits into pattern of 3-nested loops and has usual lower bounds:

GETRF - Communication assumptions • BLAS2 LU on (m x n) matrix takes • TRSM on (n x m) with LL (n x n) takes • GEMM in (m x n) - (m x k) (k x n) takes n n n n m m L P U n m m n n n A LL-1 U n m k n m k m m U A L A

GETRF – LAPACK algorithm • For each panel block: • Factorize panel (n x r) • Permute matrix • Compute U update (TRSM) of size r x (n-ir) with LL of size r x r • Compute GEMM update of size: • (n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))

GETRF – LAPACK algorithm (1/4) • Factorize panel P • Words: • Total words : r r r r n- (i-1)r n- (i-1)r L P U

GETRF – LAPACK algorithm (2/4) • Permute matrix with pivot information from panel • Words: • Total words :

GETRF – LAPACK algorithm (3/4) • Permute matrix with pivot information from panel • Words: • Total words : r n-ir n-ir r r r A LL-1 U

GETRF – LAPACK algorithm (4/4) • Permute matrix with pivot information from panel • Words: • Total words : n - ir n-ir n -ir r r n-ir n-ir n-ir A L U A

GETRF – Communication cost • Communication cost • Simplified in the big O notationwe get:

GETRF - General LU Summary • General LU lower bounds are: • LAPACK LU algorithm gives :

GBTRF - Banded LU factorization • Variables: • n - size of the matrix • b - matrix bandwidth • r - block size (panel width) • M - size of fast memory • Also fits into 3-nested loops lower bounds:

Banded Format • GBTRF uses a special “banded format” • Packed data format that stores mostly data and very few non-zeros • columns map to columns ; diagonals map to rows • easy to retrieve a square block from original A by using lda – 1

Banded Format • Because of format the update of U and of the Schur complement • get split into multiple stages for the parts of the band matrix near • the edges of the storage array Conceptual Actual

GBTRF Algorithm • For each panel block • Factorize panel of size b x r • Permute rest of matrix affected by panel • Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) • Compute U update (TRSM) of size r x r with LL of size (r x r) • Compute 4 GEMM updates of sizes: • (b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) • (b-2r) x r + ((b-2r) x r ) * (r x r) • r x (b-2r) + (r x r) * (r x (b-2r)) • r x r + (r x r) * (r x r)

GBTRF – LAPACK algorithm (1/8) • Factorize panel P • Words: • Total words : r r r r b b

GBTRF – LAPACK algorithm (2/8) • Apply permutations • Words: • Total words :

GBTRF – LAPACK algorithm (3/8) • Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) • Words: • Total words : r b – 2r b – 2r -1 r r r

GBTRF – LAPACK algorithm (4/8) • Compute U update (TRSM) of size r x r with LL of size (r x r) • Words: • Total words : r r r -1 r r r

GBTRF – LAPACK algorithm (5/8) • Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r)) • Words: • Total words : b – 2r b – 2r r b – 2r b – 2r

GBTRF – LAPACK algorithm (6/8) • Compute GEMM update of size • Words: • Total words : b – 2r b – 2r r b – 2r r

GBTRF – LAPACK algorithm (7/8) • Compute GEMM update of size • Words: • Total words : r r r r r b – 2r

GBTRF – LAPACK algorithm (8/8) • Compute GEMM update of size • Words: • Total words : r r r r r

GBTRF communication cost • A full cost would be: • If we choose r < b/3 this simplifies the leading terms to: • Since r < b the other option is b/3 < r < b which gives • in this case we get:

GBTRF - Banded LU Summary • Banded LU lower bounds are: • LAPACK banded LU algorithm gives :

Sequential Summary

Parallel banded LU - Definitions • Variables: • n - size of the matrix • p- number of processors • b - matrix bandwidth • M - size of fast memory

Parallel banded LU – Lower Bounds • Assuming banded matrix is distributed in a 1D layout across n • Lower Bounds P(i-1) P(i)

Parallel banded algorithms – (Saad ‘85) • In (Saad, Schultz ’85) we are presented with a computation • and communication analysis for banded Cholesky (LLT) solvers on a • 1D ring, 2D torus and n-D hypercube as well as a pipelined approach • While this is a different computation from LU, Cholesky can be • viewed as a minimum cost for LU since it does not require • pivoting nor the computation of the U but is also used for • Gaussian Elimination • Since most parallel banded algorithms also increase the amount • of computation done that will also be compared between the • algorithms in terms of multiplicative factors to the leading term.

Parallel banded algorithms – RIGBE

Parallel banded algorithms – BIGBE

Parallel banded algorithms – HBGE • Same algorithm as BIGGE but the 2D grid is embedded in the • Hypercube to allow for faster communication costs

Parallel banded algorithms – WFGE • Uses the 2D cyclic layout and then performs operations diagonally

Parallel banded algorithms – (Saad ‘85) • Parallel band LU lower bounds: • Banded Cholesky algorithms :

Parallel banded algorithms – SPIKE (1/3) • Another parallel banded implementation is presented in the • SPIKE Algorithm (Lawrie, Sameh ‘84) which is a Cholesky solver • which is just a special case of Gaussian Elimination • This algorithm for factorization and solver is extended to a • pivoting LU implementation in (Sameh ’05)

Parallel banded algorithms – SPIKE (2/3)

Parallel banded algorithms – SPIKE (3/3) • parallel band LU Lower Bounds • SPIKE Cholesky algorithm

PGBTRF – Data Layout • Adopts same banded layout as sequential with a slightly higher • bandwidth storage (4b instead of 3b) and 1D block distribution n 2b 2b P2 P3 P4 P1

PGBTRF – Algorithm • Description from ScaLAPACK code • 1) Compute Fully Independent band LU factorizations of the • submatrices located in local memory. • 2) Pass the upper triangular matrix from the end of the local storage • on to the next processor. • 3) From local factorization and upper triangular matrix form a reduced • blocked bidiagonal system and store extra data in Af (extra storage) • 4) Solve reduced blocked bidiagonal system to compute extra factors • and store in Af

PGBTRF – Communication cost • Parallel band LU lower bounds: • ScaLAPACK band LU algorithm:

Parallel Summary • Lower Bounds • (Saad’85) • SPIKE • ScaLAPACK

Future Work • Checking the lower bounds and implementation details of applying • CALU to the panel in the LAPACK algorithm • Investigate parallel band LU lower bounds for an exact cost • Heterogeneous analysis of implemented MAGMA sgbtrf and • lower bounds for a heterogeneous model • Looking at Nested Dissection as another Divide and Conquer • method for parallel banded LU • Analysis of cost of applying a parallel banded algorithm to the • sequential model to see if we can reduce the communication • by increasing computation

General Summary

Questions?

Communication costs of LU decomposition algorithms for banded matrices