Yihua Bai Department of Mathematics and Computer Science Indiana State University

High-Performance Eigensolver for Real Symmetric Matrices:Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department of Mathematics and Computer Science Indiana State University

Contents • Current status of real symmetric eigensolvers • Motivation • BD&C algorithm – a high performance approximate eigensolver • Parallel implementations of BD&C algorithm • Applications in electronic structure calculation and numerical results • Summary and Future Work

Current Status of Dense Symmetric Eigensolvers PDSYEVD PDSYEVX PDSYEVR

Classical Three Steps to Decompose A=XΛXT • Reduction to symmetric tridiagonal form A=HTHT • Eigen-decomposition of the tridiagonal matrix T=VΛVT • Cuppen’s divide-and-conquer • Bisection and inverse iteration • Multiple Relatively Robust Representations (MRRR) • Back-transformation of the eigenvectors X=HV

Bottleneck of Classical Approaches • Reduction time is the bottleneck PDSYEVR PDSYEVD Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS-06-572, University of Tennessee August 2006

Limitation of Classical Approaches • Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation Questions: Trade accuracy for efficiency? How?

Motivation A high performance approximate eigensolver for electronic structure calculation

Schrödinger’s Equation:An Intrinsic Eigenvalue Problem

Computation of Electronic Structure • Solve Schrödinger’s Equation efficiently • Different approximation methods • Hartree-Fock approximation • density functional theory • configuration interaction • …, etc. • Self-Consistent Field method • Solve generalized non-linear real symmetric eigenvalue problem iteratively • A standard linear eigenvalue problem solved in each iteration. • Typically the most time consuming part of electronic structure calculation • Low accuracy suffices in earlier iterations • Matrices from application problems may have locality properties

Problem Definition Given a real symmetric matrix A and accuracy tolerance  , want to compute where and contain the approximate eigenvectors and eigenvalues, respectively, and satisfy

Block Algorithms for Approximate Eigensolver 1)Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece 2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices 3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix

1) BD&C Algorithm * Decompose: where numerically orthogonal eigenvector matrix diagonal matrix of eigenvalues block tridiagonal matrix accuracy tolerance number of blocks * W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.

Three Steps of BD&C 1. Subdivision with 2. Solve Sub-problem decompose: where: , , 3. Synthesis – the most time consuming step decompose , then multiply Vi and Z Complexity: a function of deflation, rank, and size

2) Block Tridiagonalization (BT)* • An approximation to the original full matrix • May require eigenvectors from previous iteration Complexity: * Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352.

3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) * • A full matrix that cannot be sparsified •A sequence of Householder transformations Complexity:

Complexity of Major Components  message passing latency  time to transfer one floating point number  time for one floating point operation ranks for off-diagonal blocks

Parallel Implementations • Parallel block divide-and-conquer (PBD&C) * • Preprocessing • Parallel block tridiagonalization (PBT) • Parallel orthogonal block-tridiagonal reduction (POBR) ** * Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS-06-571, University of Tennessee, December 2005. Submitted to ACM TOMS ** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS-06-578, University of Tennessee, June 2006. Submitted to ACM TOMS

Implementations of PBD&C Mixed data/task parallel implementation versus complete data parallel implementation

Mixed Parallel Implementation • Mixed parallelism – data/task • Data distribution and redistribution • Merging sequence and workload balance • Deflation

Matrix Distribution – Mixed Data/Task Parallelism • Divide processors into groups of sub-grids • Assign each sub-grid to a sub-problem Block-tridiagonal matrix with q diagonal blocks

Matrix Distribution – Example 2D block cyclic distribution on each sub-grid Each diagonal block assigned a sub-grid

Data Redistribution  Redistribute data from one sub-grid to another one (subdivision step) Distribute from a 22 grid to a 3 3 grid

Data Redistribution (cont’d)  Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step) Distribute from a 22 and a 24 grids to a 34 grid

Level 4 Level 3 Level 2 Level 1 Level 0 Idle time hright hlett Final merging operation Merging Sequence Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.

Problems • Subgrid construction • Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29? • Many communicator handles • Can use up to 2k handles, where k=max(number of diagonal blocks, number of total processors) • Portability on different MPI implementations • Example: need minor modification of code when use mpimx (myrinet mpi)

Complete Data Parallel Implementation • Assign all processors to each block in block-tridiagonal matrix Assume a 2X2 processor grid, Assigned to B1, B2, …, Bq, and C1, C2, …, Cq-1. Block-tridiagonal matrix with q diagonal blocks

Advantages and Disadvantages • Advantages • One communicator • One processor grid • Portability to different MPI platform • Disadvantages • Not all processors involved in some steps • SVD of off-diagonal blocks • Decomposition of diagonal blocks • Merge smaller sub-problems • Still need data redistribution for each merging operation

Numerical Results • Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD • Matrices with different eigenvalue distributions and different sizes • Banded application matrix • Complete data parallel BD&C subroutine PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC

Machine Specifications IBM p690 System in ORNL

PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions Arithmetically distributed eigenvalues Geometrically distributed eigenvalues =10-6, b = 20

Accuracy of PDSBTDC Residual: Departure from orthogonality:

PDSBTDC on Application Matrix Polyalanine matrix, n = 5027, b = 79 PDSBTDC with different tolerances

Performance Test on UT SInRG AMD Opteron Processor 240 Cluster Similar performance and scales a little better

PDSBTDC vs. PDSBTDCD Performance Block-tridiagonal matrix with arithmetically distributed eigenvalues, Matrix size = 12000, block size = 20, tolerance = 10-6. Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.

Application in Electronic Structure Calculation • Trans-Polyacetylene • Simple chemical structure • Semiconducting conjugated polymer • Light emitting devices, flexible • Fast nonlinear optical response • Strong nonlinear susceptibility

Matrix Generated from trans-PA Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.

Two Steps to Compute Approximate Eigen-Solution • Construct block-tridiagonal matrix from the original dense matrix H • M = H + E, where M is block tridiagonal • Algorithm: PBT • Compute eigensolutions to reduced accuracy • User defined accuracy, typically 10-6 • Algorithm: PBD&C

Compare Execution Time with ScaLAPACK PDSYEVD Trans-(CH)16000. n=16000, =10-6. With lower accuracy (i.e., 10-6), the savings in execution time is order of magnitude.

With fixed per-processor problem size, The relative execution time for an O(n3) algorithm should be as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n2) and O(n3) Relative Execution Time with Fixed n2/p

Conclusion and Future Work

Conclusion • PBD&C: very efficient on block tridiagonal matrices with • Low ranks for off-diagonal blocks • High ratio of deflation • Comparison of PDSBTDC and PDSBTDCD • PDSBTDCD performs better with smaller number of processors in use • PDSBTDC scales better as the number of processors in use increases • PBD&C combined with PBT • Efficient on application matrices with specific locality property

Future Work  Incorporate PBD&C and PBT into SCF for trans-PA  Fine tuning of PDSBTDCD  Alternative method for computation of eigenvectors  Approximation in sparse eigensolver  A Parallel Adaptive Eigensolver

End of Presentation Thank you!

Acknowledgement Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University

TaskFlowchart Major Efficiency improvements from • Reduced accuracy in early iterations of SCF • Reducing the reduction bottleneck • Eigenvectors may be required if efforts made to improve efficiency

Complexity of Major Components  message passing latency  time to transfer one floating point number  time for one floating point operation nbblock size for parallel 2D matrix distribution

Yihua Bai Department of Mathematics and Computer Science Indiana State University