1 / 34

Use of Parallel SuperLU in Large-Scale Scienticfic Calculations

Use of Parallel SuperLU in Large-Scale Scienticfic Calculations. X. Sherry Li Lawrence Berkeley National Lab SIAM Annual Meeting, July 12-16, 2004. Outline. Algorithms and implementation in SuperLU_DIST Compare with sequential SuperLU In quantum mechanics (linear systems)

alain
Download Presentation

Use of Parallel SuperLU in Large-Scale Scienticfic Calculations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Use of Parallel SuperLU in Large-Scale Scienticfic Calculations X. Sherry Li Lawrence Berkeley National Lab SIAM Annual Meeting, July 12-16, 2004

  2. Outline • Algorithms and implementation in SuperLU_DIST • Compare with sequential SuperLU • In quantum mechanics (linear systems) • With M. Baertschy, C. W. McCurdy, T. N. Rescigno, W. A. Isaacs • In accelerator design (eigenvalue problems) • With P. Husbands, C. Yang, L. Lee • New developments SIAM Annual

  3. Introduction • Direct method (Gaussian elimination) very useful for ill-conditioned systems • Uses of direct solver in large-scale applications: • Solve the system completely • How accurate is the solution? • Pivot growth, condition number, error bounds • Preconditioner for iterative solvers • linear system: M-1A x = M-1b • shift-and-invert eigensolver:Kx = x (K -  I)-1 x = x where  = 1 / ( - ) SIAM Annual

  4. SuperLU (and _MT) [Demmel/Gilbert/Li] Partial pivoting With diagonal preference Sparsity order: A’*A-based Fix column order “BLAS–2.5” Left-looking (fan-in) More “read” than “write” SuperLU_DIST [Li/Demmel] “Static” pivoting Based on A Sparsity order: A’+A-based Fix row & column order BLAS–3 Right-looking (fan-out) More parallelism 2D block-cyclic layout SIAM Annual

  5. How to distribute the matrices? • Matrices involved: • A, B (turned into X) – input, users manipulate them • L, U – output, users do not need to see them • A (sparse) and B (dense) are distributed by block rows Local A stored in Compressed Row • Natural for users, and consistent with other popular packages: PETSc, Aztec, etc. x x x P0 x x x A B x x P1 x x P2 SIAM Annual

  6. 0 1 0 1 0 2 2 Process grid 3 4 5 3 4 5 3 0 2 0 1 0 1 0 2 2 4 5 3 4 5 3 4 5 3 0 1 2 0 1 2 0 3 4 5 3 3 4 5 0 1 2 0 1 2 0 Distribution of L & U • 2D (non-uniform) block cyclic distribution • Better for GE scalability, load balance • Provide re-distribution routine to change A from 1D block layout into 2D block cyclic layout 1 3 SIAM Annual

  7. Quantum Mechanics • Scattering in a quantum system of three charged particles • Simplest example is ionization of a hydrogen atom by collision with an electron: e- + H  H+ + 2e- • Seek the particles’ wave functions represented by the time-independent Schrodinger equation • First solution to this long-standing unsolved problem [Recigno, et. al. Science, 24 Dec 1999] SIAM Annual

  8. Quantum Mechanics, cont. • New Exterior Complex Scaling formulism to represent scattering states which simplifies boundary conditions • Becomes computationally tractable • Finite difference leads to complex, unsymmetric systems • Diagonal blocks have the structure of 2D finite difference Laplacian matrices Very sparse: nonzeros per row <= 13 • Off-diagonal block is a diagonal matrix • Between 6 to 24 blocks, each of order between 200K and 2M • Total dimension up to 8.4 M • Too much fill if use direct method . . . SIAM Annual

  9. Matrix Problem • Example: N = 1024, NNZ = 12544, Cond. Num. 4e+03 • No iterative solvers without preconditioners or with simple preconditioners converge SIAM Annual

  10. A11 A22 A33 0 1 2 3 4 5 6 7 8 9 10 11 SuperLU_DIST as Preconditioner • Block-diagonal preconditioner for CGS iteration: M-1 A x = M-1 b M = diag(A11, A22, A33, …) • Use SuperLU_DIST for each diagonal block • Need create 3 process grids, same logical ranks (0:3), different physical ranks • Each grid has its own MPI communicator • No pivoting, nor iterative refinement SIAM Annual

  11. Complex, unsymmetric N = 2 M, NNZ = 26 M Fill-ins using Metis: 1.3 G (50x fill) Factorization speed 10x speedup (4 to 128 P) Up to 30 Gflops Timings on IBM SP SIAM Annual

  12. Accelerator Cavity Design • Calculate cavity mode frequencies and field vectors • Solve Maxwell equation in electromagnetic field • Omega3P simulation code (SLAC) Omega3P model of a 47-cell section of the 206-cell Next Linear Collider accelerator structure Individual cells used in accelerator structure SIAM Annual

  13. Finite element methods lead to large sparse generalized eigensystem Kx = Mx Real symmetric for lossless cavities Complex symmetric when lossy in cavities Seek interior eigenvalues (tightly clustered) that are relatively small in magnitude Accelerator, cont. SIAM Annual

  14. Accelerator, cont. • Speed up Lanczos convergence by shift-invert  Seek largest eigenvalues, well separated, of the transformed system M (K -  M)-1 x =  M x  = 1 / ( - ) • The Filtering algorithm [Y. Sun] • Inexact shift-invert Lanczos + JOCC • Exact shift-invert Lanczos (ESIL) • PARPACK for Lanczos • SuperLU_DIST for shifted linear system • No pivoting, nor iterative refinement SIAM Annual

  15. DDS47, Linear Elements • Total eigensolver time: N = 1.3 M, NNZ = 20 M SIAM Annual

  16. Largest Problem • DDS47, quadratic elements • N = 7.5 M, NNZ = 304 M • 6 G fill-ins using Metis • 24 processors (8x3) • Factor: 3,347 s • 1 Solve: 61 s • Eigensolver: 9,259 s (~2.5 hrs) • 1 shift, 10 eigenvalues, 55 solves SIAM Annual

  17. Parallel Symbolic Factorization[L. Grigori] • Parallel ordering with ParMETIS on G(A’+A) • Separator tree (binary) to guide computation • Each step: one row of U, one column of L • Within each separator: 1D block cyclic distribution • Send necessary contribution to parent processor • Preliminary results: • Reasonable speedup: up to 6x • Need improve memory balance SIAM Annual

  18. Improve Triangular Solve • Important in preconditioning . . . • “Switch-to-dense” towards the end • Dense part can be of order • Remove overhead of sparse data structure and control structure • On uniprocessors, achieved 1.8x speedup together with register blocking [Vuduc et al.] • Better speedup expected in parallel code SIAM Annual

  19. Improve Triangular Solve • Partitioned inverse [Alvarado,Pothen,Schreiber] • Preprocessing: find a partitioned representation of L • Solve: m sparse matrix-vector multiply • Requirements: • Each is invertible in place (no more fills generated) • Find m as small as possible by proper reordering • Trivial partitions: • m = n, i.e., each is an elementary matrix from GE • m = # of supernodes, i.e., supernode partition SIAM Annual

  20. Extras

  21. GE with Static Pivoting • Before factorization, scale and permute A to maximize diagonal: A  Pr Dr A Dc • MC64 [Duff/Koster ‘99] • Parallel auction algorithm on the way [Riedy] • Find a permutation Pc to preserve sparsity: A  Pc A Pc’ • While factorizing A = LU, replace tiny pivots by e ||A||, without changing structures of L & U • Most matrices insensitive to this perturbation • If needed, use iterative refinement or an iterative solver to improve first solution SIAM Annual

  22. Numerical Results • 68 unsymmetric matrices • If no pivoting at all • 26 contain zeros on diagonal to begin with • 2 create new zeros on diagonal during elimination • bbmat, orsreg_1 • Most of the others get large errors due to pivot growth • Working precision iterative refinement (64 bits) • Detailed table at www.nersc.gov/~xiaoye/SuperLU/GESP SIAM Annual

  23. Sparse Backward Error SIAM Annual

  24. Accuracy: GESP vs GEPP SIAM Annual

  25. Refinement Steps SIAM Annual

  26. Times of Various Phases SIAM Annual

  27. Better Sparsity Ordering • SuperLU_DIST: fill-ins in L+U (10^6) • Diagonal Markowitz (based on A): 10-15% better [Amestoy/Li/Ng ’03] SIAM Annual

  28. BLAS 3 Kernel • Copying is almost free • 20-40% better than BLAS 2.5 on IBM SP SIAM Annual

  29. Default Setting May Not Be Good • Diagonal perturbation is bad for 4 matrices • Ex11, fidap011, inaccura, raefsky4 • Work well by either one of the 2 methods • Turn off diagonal perturbation, or • Use LU as preconditioner for GMRES [SPARSKIT] SIAM Annual

  30. SuperLU_DIST as Preconditioner • SuperLU_DIST as block-diagonal preconditioner for CGS iteration M-1A x = M-1b M = diag(A11, A22, A33, …) • Run multiple SuperLU_DIST simultaneously • No pivoting, nor iterative refinement • 96 to 280 iterations @ 1 ~ 2 minute/iteration using 64 IBM SP processors • Total time ~ 1 to 8 hours SIAM Annual

  31. SuperLU_DIST Enhancements • 64-bit addressing • Some integers need to be 64-bit for the largest problem (otherwise integer overflow) • Will large memory SMP nodes help us? • IBM SP nodes 16-way 375MHz POWER3GB/node: 16, 32 (64 nodes), 64 (4 nodes) “Colony” switch SIAM Annual

  32. P1 P2 P3 P1 P2 P3 P0 P0 Use of Shared Memory • Symbolic factorization (7 GB for largest problem) still serial and replicated on all processors • Save communication by not having to broadcast output • Better to perform 1 symbfact() per SMP node and share the results [Husbands] • Also used in High Performance Linpack SIAM Annual

  33. Some Open Problems • Parameter setting is configurable by users • Need an automatic strategy to choose a proper configuration • Run the default configuration first (mostly works) • Compute forward & backward error bounds • If large errors, invoke an iterative solver • How does extra-precision iterative refinement help? • Investigating dense LU using XBLAS • Guaranteed stability if variable precision is used during GE [Demmel] SIAM Annual

  34. Future Work • Parallel symbolic factorization to enhance memory scalability • Better triangular solve (latency-bound) • Include ILU capability • Optimizations for SMP clusters • Hybrid approach with threads in each MPI task SIAM Annual

More Related