1 / 45

Communication costs of LU decomposition algorithms for banded matrices

Communication costs of LU decomposition algorithms for banded matrices. Razvan Carbunescu. Outline (1/2). Sequential general LU factorization (GETRF) and Lower Bounds Definitions and Lower Bounds LAPACK algorithm Communication cost Summary

clive
Download Presentation

Communication costs of LU decomposition algorithms for banded matrices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication costs of LU decomposition algorithms for banded matrices RazvanCarbunescu

  2. Outline (1/2) • Sequential general LU factorization (GETRF) and Lower Bounds • Definitions and Lower Bounds • LAPACK algorithm • Communication cost • Summary • Sequential banded LU factorization (GBTRF) and Lower Bounds • Definitions and Lower Bounds • Banded format • LAPACK algorithm • Communication cost • Summary • Sequential LU Summary

  3. Outline (2/2) • Parallel LU definitions and Lower bounds • Parallel Cholesky algorithms (Saad, Schultz ‘85) • SPIKE Cholesky algorithm (Sameh’85) • Parallel banded LU factorization (PGBTRF) • ScaLAPACKalgorithm • Communication cost • Summary • Parallel banded LU and CholeskySummary • Future Work • General Summary

  4. GETRF – Definitions and Lower Bounds • Variables: • n - size of the matrix • r - block size (panel width) • i- current panel number • M - size of fast memory • fits into pattern of 3-nested loops and has usual lower bounds:

  5. GETRF - Communication assumptions • BLAS2 LU on (m x n) matrix takes • TRSM on (n x m) with LL (n x n) takes • GEMM in (m x n) - (m x k) (k x n) takes n n n n m m L P U n m m n n n A LL-1 U n m k n m k m m U A L A

  6. GETRF – LAPACK algorithm • For each panel block: • Factorize panel (n x r) • Permute matrix • Compute U update (TRSM) of size r x (n-ir) with LL of size r x r • Compute GEMM update of size: • (n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))

  7. GETRF – LAPACK algorithm (1/4) • Factorize panel P • Words: • Total words : r r r r n- (i-1)r n- (i-1)r L P U

  8. GETRF – LAPACK algorithm (2/4) • Permute matrix with pivot information from panel • Words: • Total words :

  9. GETRF – LAPACK algorithm (3/4) • Permute matrix with pivot information from panel • Words: • Total words : r n-ir n-ir r r r A LL-1 U

  10. GETRF – LAPACK algorithm (4/4) • Permute matrix with pivot information from panel • Words: • Total words : n - ir n-ir n -ir r r n-ir n-ir n-ir A L U A

  11. GETRF – Communication cost • Communication cost • Simplified in the big O notationwe get:

  12. GETRF - General LU Summary • General LU lower bounds are: • LAPACK LU algorithm gives :

  13. GBTRF - Banded LU factorization • Variables: • n - size of the matrix • b - matrix bandwidth • r - block size (panel width) • M - size of fast memory • Also fits into 3-nested loops lower bounds:

  14. Banded Format • GBTRF uses a special “banded format” • Packed data format that stores mostly data and very few non-zeros • columns map to columns ; diagonals map to rows • easy to retrieve a square block from original A by using lda – 1

  15. Banded Format • Because of format the update of U and of the Schur complement • get split into multiple stages for the parts of the band matrix near • the edges of the storage array Conceptual Actual

  16. GBTRF Algorithm • For each panel block • Factorize panel of size b x r • Permute rest of matrix affected by panel • Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) • Compute U update (TRSM) of size r x r with LL of size (r x r) • Compute 4 GEMM updates of sizes: • (b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) • (b-2r) x r + ((b-2r) x r ) * (r x r) • r x (b-2r) + (r x r) * (r x (b-2r)) • r x r + (r x r) * (r x r)

  17. GBTRF – LAPACK algorithm (1/8) • Factorize panel P • Words: • Total words : r r r r b b

  18. GBTRF – LAPACK algorithm (2/8) • Apply permutations • Words: • Total words :

  19. GBTRF – LAPACK algorithm (3/8) • Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) • Words: • Total words : r b – 2r b – 2r -1 r r r

  20. GBTRF – LAPACK algorithm (4/8) • Compute U update (TRSM) of size r x r with LL of size (r x r) • Words: • Total words : r r r -1 r r r

  21. GBTRF – LAPACK algorithm (5/8) • Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r)) • Words: • Total words : b – 2r b – 2r r b – 2r b – 2r

  22. GBTRF – LAPACK algorithm (6/8) • Compute GEMM update of size • Words: • Total words : b – 2r b – 2r r b – 2r r

  23. GBTRF – LAPACK algorithm (7/8) • Compute GEMM update of size • Words: • Total words : r r r r r b – 2r

  24. GBTRF – LAPACK algorithm (8/8) • Compute GEMM update of size • Words: • Total words : r r r r r

  25. GBTRF communication cost • A full cost would be: • If we choose r < b/3 this simplifies the leading terms to: • Since r < b the other option is b/3 < r < b which gives • in this case we get:

  26. GBTRF - Banded LU Summary • Banded LU lower bounds are: • LAPACK banded LU algorithm gives :

  27. Sequential Summary

  28. Parallel banded LU - Definitions • Variables: • n - size of the matrix • p- number of processors • b - matrix bandwidth • M - size of fast memory

  29. Parallel banded LU – Lower Bounds • Assuming banded matrix is distributed in a 1D layout across n • Lower Bounds P(i-1) P(i)

  30. Parallel banded algorithms – (Saad ‘85) • In (Saad, Schultz ’85) we are presented with a computation • and communication analysis for banded Cholesky (LLT) solvers on a • 1D ring, 2D torus and n-D hypercube as well as a pipelined approach • While this is a different computation from LU, Cholesky can be • viewed as a minimum cost for LU since it does not require • pivoting nor the computation of the U but is also used for • Gaussian Elimination • Since most parallel banded algorithms also increase the amount • of computation done that will also be compared between the • algorithms in terms of multiplicative factors to the leading term.

  31. Parallel banded algorithms – RIGBE

  32. Parallel banded algorithms – BIGBE

  33. Parallel banded algorithms – HBGE • Same algorithm as BIGGE but the 2D grid is embedded in the • Hypercube to allow for faster communication costs

  34. Parallel banded algorithms – WFGE • Uses the 2D cyclic layout and then performs operations diagonally

  35. Parallel banded algorithms – (Saad ‘85) • Parallel band LU lower bounds: • Banded Cholesky algorithms :

  36. Parallel banded algorithms – SPIKE (1/3) • Another parallel banded implementation is presented in the • SPIKE Algorithm (Lawrie, Sameh ‘84) which is a Cholesky solver • which is just a special case of Gaussian Elimination • This algorithm for factorization and solver is extended to a • pivoting LU implementation in (Sameh ’05)

  37. Parallel banded algorithms – SPIKE (2/3)

  38. Parallel banded algorithms – SPIKE (3/3) • parallel band LU Lower Bounds • SPIKE Cholesky algorithm

  39. PGBTRF – Data Layout • Adopts same banded layout as sequential with a slightly higher • bandwidth storage (4b instead of 3b) and 1D block distribution n 2b 2b P2 P3 P4 P1

  40. PGBTRF – Algorithm • Description from ScaLAPACK code • 1) Compute Fully Independent band LU factorizations of the • submatrices located in local memory. • 2) Pass the upper triangular matrix from the end of the local storage • on to the next processor. • 3) From local factorization and upper triangular matrix form a reduced • blocked bidiagonal system and store extra data in Af (extra storage) • 4) Solve reduced blocked bidiagonal system to compute extra factors • and store in Af

  41. PGBTRF – Communication cost • Parallel band LU lower bounds: • ScaLAPACK band LU algorithm:

  42. Parallel Summary • Lower Bounds • (Saad’85) • SPIKE • ScaLAPACK

  43. Future Work • Checking the lower bounds and implementation details of applying • CALU to the panel in the LAPACK algorithm • Investigate parallel band LU lower bounds for an exact cost • Heterogeneous analysis of implemented MAGMA sgbtrf and • lower bounds for a heterogeneous model • Looking at Nested Dissection as another Divide and Conquer • method for parallel banded LU • Analysis of cost of applying a parallel banded algorithm to the • sequential model to see if we can reduce the communication • by increasing computation

  44. General Summary

  45. Questions?

More Related