High-Performance Eigensolver for
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Yihua Bai Department of Mathematics and Computer Science Indiana State University PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation. Yihua Bai Department of Mathematics and Computer Science Indiana State University. Contents. Current status of real symmetric eigensolvers Motivation

Download Presentation

Yihua Bai Department of Mathematics and Computer Science Indiana State University

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Yihua bai department of mathematics and computer science indiana state university

High-Performance Eigensolver for Real Symmetric Matrices:Parallel Implementations and Applications in Electronic Structure Calculation

Yihua Bai

Department of Mathematics and Computer Science

Indiana State University


Contents

Contents

  • Current status of real symmetric eigensolvers

  • Motivation

  • BD&C algorithm – a high performance approximate eigensolver

  • Parallel implementations of BD&C algorithm

  • Applications in electronic structure calculation and numerical results

  • Summary and Future Work


Current status of dense symmetric eigensolvers

Current Status of Dense Symmetric Eigensolvers

PDSYEVD

PDSYEVX

PDSYEVR


Classical three steps to decompose a x x t

Classical Three Steps to Decompose A=XΛXT

  • Reduction to symmetric tridiagonal form

    A=HTHT

  • Eigen-decomposition of the tridiagonal matrix

    T=VΛVT

    • Cuppen’s divide-and-conquer

    • Bisection and inverse iteration

    • Multiple Relatively Robust Representations (MRRR)

  • Back-transformation of the eigenvectors

    X=HV


Bottleneck of classical approaches

Bottleneck of Classical Approaches

  • Reduction time is the bottleneck

PDSYEVR

PDSYEVD

Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS-06-572, University of Tennessee August 2006


Limitation of classical approaches

Limitation of Classical Approaches

  • Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation

Questions:

Trade accuracy for efficiency?

How?


Motivation

Motivation

A high performance approximate eigensolver for electronic structure calculation


Schr dinger s equation an intrinsic eigenvalue problem

Schrödinger’s Equation:An Intrinsic Eigenvalue Problem


Computation of electronic structure

Computation of Electronic Structure

  • Solve Schrödinger’s Equation efficiently

  • Different approximation methods

    • Hartree-Fock approximation

    • density functional theory

    • configuration interaction

    • …, etc.

  • Self-Consistent Field method

    • Solve generalized non-linear real symmetric eigenvalue problem iteratively

    • A standard linear eigenvalue problem solved in each iteration.

    • Typically the most time consuming part of electronic structure calculation

    • Low accuracy suffices in earlier iterations

    • Matrices from application problems may have locality properties


Problem definition

Problem Definition

Given a real symmetric matrix A and accuracy

tolerance  , want to compute

where and contain the approximate eigenvectors

and eigenvalues, respectively, and satisfy


Block algorithms for approximate eigensolver

Block Algorithms for Approximate Eigensolver

1)Block-tridiagonal divide-and-conquer (BD&C)

– The centerpiece

2) Block tridiagonalization (BT)

– Block tridiagonalization of sparse and

“effectively” sparse matrices

3) Orthogonal reduction of full matrix to block-

tridiagonal form (OBR)

– Orthogonal transformations to produce

block-tridiagonal matrix


1 bd c algorithm

1) BD&C Algorithm *

Decompose:

where

numerically orthogonal eigenvector matrix

diagonal matrix of eigenvalues

block tridiagonal matrix

accuracy tolerance

number of blocks

* W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.


Three steps of bd c

Three Steps of BD&C

1. Subdivision

with

2. Solve Sub-problem

decompose:

where:

,

,

3. Synthesis – the most time consuming step

decompose

, then multiply Vi and Z

Complexity:

a function of deflation, rank, and size


2 block tridiagonalization bt

2) Block Tridiagonalization (BT)*

  • An approximation to the original full matrix

  • May require eigenvectors from previous iteration

Complexity:

* Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352.


3 orthogonal reduction to block tridiagonal matrix obr

3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) *

• A full matrix that cannot be sparsified

•A sequence of Householder transformations

Complexity:


Complexity of major components

Complexity of Major Components

 message passing latency

 time to transfer one floating point number

 time for one floating point operation ranks for off-diagonal blocks


Parallel implementations

Parallel Implementations

  • Parallel block divide-and-conquer (PBD&C) *

  • Preprocessing

    • Parallel block tridiagonalization (PBT)

    • Parallel orthogonal block-tridiagonal reduction (POBR) **

* Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS-06-571, University of Tennessee, December 2005. Submitted to ACM TOMS

** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS-06-578, University of Tennessee, June 2006. Submitted to ACM TOMS


Implementations of pbd c

Implementations of PBD&C

Mixed data/task parallel implementation

versus

complete data parallel implementation


Mixed parallel implementation

Mixed Parallel Implementation

  • Mixed parallelism – data/task

  • Data distribution and redistribution

  • Merging sequence and workload balance

  • Deflation


Matrix distribution mixed data task parallelism

Matrix Distribution – Mixed Data/Task Parallelism

  • Divide processors into groups of sub-grids

  • Assign each sub-grid to a sub-problem

Block-tridiagonal matrix

with q diagonal blocks


Matrix distribution example

Matrix Distribution – Example

2D block cyclic distribution

on each sub-grid

Each diagonal block

assigned a sub-grid


Data redistribution

Data Redistribution

 Redistribute data from one sub-grid to another one (subdivision step)

Distribute from a 22 grid to a 3 3 grid


Data redistribution cont d

Data Redistribution (cont’d)

 Redistribute data for each merging operation from

two sub-grids to one super-grid (synthesis step)

Distribute from a 22 and a 24 grids to a 34 grid


Merging sequence

Level 4

Level 3

Level 2

Level 1

Level 0

Idle time

hright

hlett

Final merging operation

Merging Sequence

Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.


Problems

Problems

  • Subgrid construction

    • Example: subgrid 1: 2X2

      subgrid 2: 5X5

      supergrid: 1X29?

  • Many communicator handles

    • Can use up to 2k handles, where k=max(number of diagonal blocks, number of total processors)

  • Portability on different MPI implementations

    • Example: need minor modification of code when use mpimx (myrinet mpi)


Complete data parallel implementation

Complete Data Parallel Implementation

  • Assign all processors to each block in block-tridiagonal matrix

Assume a 2X2 processor grid,

Assigned to B1, B2, …, Bq,

and C1, C2, …, Cq-1.

Block-tridiagonal matrix

with q diagonal blocks


Advantages and disadvantages

Advantages and Disadvantages

  • Advantages

    • One communicator

    • One processor grid

    • Portability to different MPI platform

  • Disadvantages

    • Not all processors involved in some steps

      • SVD of off-diagonal blocks

      • Decomposition of diagonal blocks

      • Merge smaller sub-problems

    • Still need data redistribution for each merging operation


Numerical results

Numerical Results

  • Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD

    • Matrices with different eigenvalue distributions and different sizes

    • Banded application matrix

  • Complete data parallel BD&C subroutine PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC


Machine specifications ibm p690 system in ornl

Machine Specifications IBM p690 System in ORNL


Pdsbtdc vs pdsyevd on matrices with different eigenvalue distributions

PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions

Arithmetically distributed

eigenvalues

Geometrically distributed

eigenvalues

=10-6, b = 20


Accuracy of pdsbtdc

Accuracy of PDSBTDC

Residual:

Departure from orthogonality:


Pdsbtdc on application matrix

PDSBTDC on Application Matrix

Polyalanine matrix,

n = 5027, b = 79

PDSBTDC with different

tolerances


Performance test on ut sinrg amd opteron processor 240 cluster

Performance Test on UT SInRG AMD Opteron Processor 240 Cluster

Similar performance and

scales a little better


Pdsbtdc vs pdsbtdcd performance

PDSBTDC vs. PDSBTDCD Performance

Block-tridiagonal matrix with arithmetically distributed eigenvalues,

Matrix size = 12000, block size = 20, tolerance = 10-6.

Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.


Application in electronic structure calculation

Application in Electronic Structure Calculation

  • Trans-Polyacetylene

  • Simple chemical structure

  • Semiconducting conjugated polymer

  • Light emitting devices, flexible

  • Fast nonlinear optical response

  • Strong nonlinear susceptibility


Matrix generated from trans pa

Matrix Generated from trans-PA

Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for

Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and

Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.


Two steps to compute approximate eigen solution

Two Steps to Compute Approximate Eigen-Solution

  • Construct block-tridiagonal matrix from the original dense matrix H

    • M = H + E, where M is block tridiagonal

    • Algorithm: PBT

  • Compute eigensolutions to reduced accuracy

    • User defined accuracy, typically 10-6

    • Algorithm: PBD&C


Compare execution time with scalapack pdsyevd

Compare Execution Time with ScaLAPACK PDSYEVD

Trans-(CH)16000.

n=16000, =10-6.

With lower accuracy (i.e., 10-6),

the savings in execution time is

order of magnitude.


Relative execution time with fixed n 2 p

With fixed per-processor problem size,

The relative execution time for an O(n3)

algorithm should be

as the reference line shows. The curve

for our new parallel algorithm shows a computational complexity between

O(n2) and O(n3)

Relative Execution Time with Fixed n2/p


Conclusion and future work

Conclusion and Future Work


Conclusion

Conclusion

  • PBD&C: very efficient on block tridiagonal matrices with

    • Low ranks for off-diagonal blocks

    • High ratio of deflation

  • Comparison of PDSBTDC and PDSBTDCD

    • PDSBTDCD performs better with smaller number of processors in use

    • PDSBTDC scales better as the number of processors in use increases

  • PBD&C combined with PBT

    • Efficient on application matrices with specific locality property


Future work

Future Work

 Incorporate PBD&C and PBT into SCF for trans-PA

 Fine tuning of PDSBTDCD

 Alternative method for computation of eigenvectors

 Approximation in sparse eigensolver

 A Parallel Adaptive Eigensolver


End of presentation

End of Presentation

Thank you!


Acknowledgement

Acknowledgement

Dr. R. P. Muller Sandia National Laboratories

Dr. G. Zhang Indiana State University


Task flowchart

TaskFlowchart

Major Efficiency improvements from

• Reduced accuracy in early iterations of SCF

• Reducing the reduction bottleneck

• Eigenvectors may be required if efforts made to improve efficiency


Complexity of major components1

Complexity of Major Components

 message passing latency

 time to transfer one floating point number

 time for one floating point operation

nbblock size for parallel 2D matrix distribution


  • Login