1 / 25

Lecture 4 Sparse Factorization: Data-flow Organization

4 th Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai. Lecture 4 Sparse Factorization: Data-flow Organization. Xiaoye Sherry Li Lawrence Berkeley National Laboratory , USA xsli@ lbl.gov crd-legacy.lbl.gov /~ xiaoye /G2S3/. Lecture outline.

bert
Download Presentation

Lecture 4 Sparse Factorization: Data-flow Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 4th Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai Lecture 4Sparse Factorization: Data-flow Organization Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA xsli@lbl.gov crd-legacy.lbl.gov/~xiaoye/G2S3/

  2. Lecture outline • Dataflow organization: left-looking, right-looking • Blocking for high performance • Supernode, multifrontal • Triangular solution

  3. Dense Cholesky • Right-looking Cholesky for k = 1,…,n do for i = k+1,…,n do for j = k+1,…,i do end for end for end for • Left-looking Cholesky for k = 1,…,n do for i = k,…,n do for j = 1,…k-1 do end for end for for i = k+1,…,n do end for end for

  4. 4 1 7 G(A) 1 2 3 4 5 6 7 8 9 6 3 8 1 2 2 5 9 3 4 5 6 9 T(A) 7 8 8 9 7 A 6 3 4 1 2 5 Sparse Cholesky • Reference case: regular 3 x 3 grid ordered by nested dissection. Nodes in the separators are ordered last • Notation: • cdiv(j) : divide column j by a scalar • cmod(j, k) : update column j with column k • struct(L(1:k), j)) : the structure of L(1:k, j) submatrix

  5. Sparse left-looking Cholesky for j = 1 to n do for k in struct(L(j, 1 : j-1)) do cmod(j, k) end for cdiv(j) end for Before variable j is eliminated, column j is updated with all the columns that have a nonzero on row j. In the example above, struct(L(7,1:6))= {1; 3; 4; 6}. • This corresponds to receiving updates from nodes lower in the subtreerooted at j • The filled graph is necessary to determine the structure of each row

  6. Sparse right-looking Cholesky for k = 1 to n do cdiv(k) for j in struct(L(k+1 : n, k)) do cmod(j,k) end for end for After variable k is eliminated, column k is used to update all the columns corresponding to nonzerosin column k. In the example above, struct(L(4:9,3))={7; 8; 9}. • This corresponds to sending updates to nodes higher in the elimination tree • The filled graph is necessary to determine the structure of each column

  7. U U A DONE L L A NOT TOUCHED ACTIVE DONE ACTIVE Sparse LU • Right-looking: many more writes than reads for k = 1 to n do cdiv(k) for j in struct(U(k, k+1:n)) do cmod(j, k) end for end for • Left-looking: many more reads than writes U(1:j-1, j) = L(1:j-1, 1:j-1) \ A(1:j-1, j) for j = 1 to n do for k in struct(U(1:j-1, j)) do cmod(j, k) end for cdiv(j) end for j j

  8. High Performance Issues: Reduce Cost of Memory Access & Communication • Blocking to increase number of floating-point operations performed for each memory access • Aggregate small messages into one larger message • Reduce cost due to latency • Well done in LAPACK, ScaLAPACK • Dense and banded matrices • Adopted in the new generation sparse software • Performance much more sensitive to latency in sparse case

  9. Blocking: supernode • Use (blocked) CRS or CCS, and any ordering method • Leave room for fill-ins ! (symbolic factorization) • Exploit “supernodal” (dense) structures in the factors • Can use Level 3 BLAS • Reduce inefficient indirect addressing (scatter/gather) • Reduce graph traversal time using a coarser graph

  10. 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Factors L+U Nonsymmetricsupernodes Original matrix A

  11. SuperLUspeedup over unblocked code • Sorted in increasing “reuse ratio” = #Flops/nonzeros • ~ Arithmetic Intensity • Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000

  12. 4 1 G(A) 7 1 2 3 4 5 6 7 8 9 6 3 8 1 2 2 5 9 3 4 5 6 9 T(A) 7 8 8 9 7 A 6 3 4 1 2 5 Symmetric-pattern multifrontalfactorization[John Gilbert’s lecture]

  13. 4 1 G(A) 7 6 3 8 2 5 9 9 T(A) 8 7 6 3 4 1 2 5 Symmetric-pattern multifrontal factorization For each node of T from leaves to root: • Sum own row/col of A with children’s Update matrices into Frontal matrix • Eliminate current variable from Frontal matrix, to get Update matrix • Pass Update matrix to parent

  14. 4 1 3 7 1 G(A) 7 1 3 6 3 8 7 F1 = A1 => U1 2 5 9 9 T(A) 8 3 7 7 3 7 6 3 4 1 2 5 Symmetric-pattern multifrontal factorization For each node of T from leaves to root: • Sum own row/col of A with children’s Update matrices into Frontal matrix • Eliminate current variable from Frontal matrix, to get Update matrix • Pass Update matrix to parent

  15. 4 1 3 7 1 7 1 G(A) 3 6 3 8 7 F1 = A1 => U1 2 5 9 9 2 3 9 T(A) 3 9 8 3 7 2 3 7 3 3 9 7 6 3 9 4 1 2 5 F2 = A2 => U2 Symmetric-pattern multifrontal factorization For each node of T from leaves to root: • Sum own row/col of A with children’s Update matrices into Frontal matrix • Eliminate current variable from Frontal matrix, to get Update matrix • Pass Update matrix to parent

  16. 4 1 3 7 1 7 1 G(A) 3 6 3 8 7 F1 = A1 => U1 2 5 9 2 3 9 2 3 9 9 F2 = A2 => U2 8 3 3 9 7 3 7 8 9 7 3 3 3 7 8 9 7 9 6 7 3 7 8 4 1 2 5 8 9 9 F3 = A3+U1+U2 => U3 Symmetric-pattern multifrontal factorization T(A)

  17. 4 1 7 G+(A) 1 2 3 4 5 6 7 8 9 6 3 8 1 2 2 5 9 3 4 5 6 9 T(A) 7 8 8 9 7 L+U 6 3 4 1 2 5 Symmetric-pattern multifrontal factorization

  18. G(A) 9 T(A) 4 1 7 8 7 6 3 8 6 3 2 5 4 1 2 5 9 Symmetric-pattern multifrontalfactorization • variant of right-looking • Really uses supernodes, not nodes • All arithmetic happens on dense square matrices. • Needs extra memory for a stack of pending update matrices • Potential parallelism: • between independent tree branches • parallel dense ops on frontal matrix

  19. Sparse triangular solution • Forward substitution for x = L \ b (back substitution for x = U \ b) • Row-oriented = dot-product = left-looking for i = 1 to n do x(i) = b(i); // dot-product for j = 1 to i-1 do x(i) = x(i) – L(i, j) * x(j); end for x(i) = x(i) / L(i, i); end for

  20. Sparse triangular solution: x = L \ b • column-oriented = saxpy = right-looking • Either way works in O(nnz(L)) time x(1:n) = b(1:n); for j = 1 to n do x(j) = x(j) / L(j, j); // saxpy x(j+1:n) = x(j+1:n) – L(j+1:n, j) * x(j); end for

  21. 2 1 4 5 7 6 3 Sparse right-hand side: x = L \ b, b sparse Use Directed Acyclic Graph (DAG) • If A is triangular, G(A) has no cycles • Lower triangular => edges directed from higher to lower #s • Upper triangular => edges directed from lower to higher #s G(A) A

  22. 1 2 3 4 5 = 1 3 5 2 4 Sparse right-hand side: x = L \ b, b sparse b is sparse, x is also sparse, but may have fill-ins x b L G(LT) • Symbolic: • Predict structure of x by depth-first search from nonzeros of b • Numeric: • Compute values of x in topological order Time = O(flops)

  23. U A L NOT TOUCHED ACTIVE DONE Recall: left-looking sparse LU • Used in symbolic factorization to find nonzeros in column j sparse right-hand side U(1:j-1, j) = L(1:j-1, 1:j-1) \ A(1:j-1, j) for j = 1 to n d for k in struct(U(1:j-1, j)) do cmod(j, k) end for cdiv(j) end for j

  24. References • M.T. Heath, E. Ng., B.W. Peyton, “Parallel Algorithms for Sparse Linear Systems”, SIAM Review, Vol. 33 (3), pp. 420-460, 1991. • E. Rothberg and A. Gupta, “Efficient Sparse Matrix Factorization on High-Performance Workstations--Exploiting the Memory Hierarchy”, ACM. Trans. Math, Software, Vol. 17 (3), pp. 313-334, 1991 • E. Rothberg and A. Gupta, “An Efficient Block-Oriented Approach to Parallel Sparse Cholesky Factorization, SIAM J. Sci. Comput., Vol. 15 (6), pp. 1413-1439, 1994. • J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W.H. Liu, “A Supernodal Approach to Sparse Partial Pivoting'’, SIAM J. Matrix Analysis and Applications, vol. 20 (3), pp. 720-755, 1999. • J. W. H. Liu, “The Multifrontal Method for Sparse Matrix Solution: theory and Practice”, SIAM Review, Vol. 34 (1), pp. 82-109, 1992.

  25. Exercises • Study and run the OpenMPcode of dense LU factorization in Hands-On-Exercises/ directory

More Related