Enhancing LAPACK and ScaLAPACK: Meeting the Future Challenges

The Future of LAPACK and ScaLAPACKwww.netlib.org/lapack-dev Jim Demmel UC Berkeley 27 March 2006

Outline • Motivation for new Sca/LAPACK • Challenges (or research opportunities…) • Goals of new ScaLAPACK • Highlights of progress

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >56M web hits @ Netlib (incl. CLAPACK, LAPACK95)

Impact (with NERSC, LBNL) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >56M web hits @ Netlib (incl. CLAPACK, LAPACK95) • Many ways to improve them, based on • Own algorithmic research • Enthusiastic participation of research community • User/vendor survey • Opportunities and demands of new architectures, programming languages • New releases planned (NSF support)

Participants • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov • Other Academic Institutions • UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara • TU Berlin, FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • CERFACS, LBL • Industrial Partners • Cray, HP, Intel, MathWorks, NAG, SGI

Challenges • For all large scale computing, not just linear algebra! • Example …

Parallelism in the Top500

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop

CPU Trends • Relative processing power will continue to double every 18 months • 256 logical processors per chip in late 2010

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency

Commodity Processor Trends Will our algorithms run at a high fraction of peak? Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency

Parallel Processor Trends Will our algorithms scale up to more processors? Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency • Heterogeneity (performance and semantics) • Asynchrony • Unreliability

What do users want? • High performance, ease of use, … • Survey results at www.netlib.org/lapack-dev • Small but interesting sample • What matrix sizes do you care about? • 1000s: 34% • 10,000s: 26% • 100,000s or 1Ms: 26% • How many processors, on distributed memory? • >10: 34%, >100: 31%, >1000: 19% • Do you use more than double precision? • Sometimes or frequently: 16% • Would Automatic Memory Allocation help? • Very useful: 72%, Not useful: 14%

Goals of next Sca/LAPACK • Better algorithms • Faster, more accurate • Expand contents • More functions, more parallel implementations • Automate performance tuning • Improve ease of use • Better software engineering • Increased community involvement

Goal 1: Better Algorithms • Faster • But provide “usual” accuracy, stability • More accurate • But provide “usual” speed • Or at any cost

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Bischof / Lang , Howell / Fulton • Recursive blocked layouts for packed formats: • Gustavson / Kågström / Elmroth / Jonsson/

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Faster and more accurate than previous algorithms • New sequential, first parallel versions out in 2006

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Timing of Eigensolvers(only matrices where time > .1 sec)

Accuracy Results (old vs new Grail) || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Faster and more accurate than previous algorithms • New sequential, first parallel versions out in 2006 • Numerical evidence shows DC faster if It “deflates” often, which is hard to predict in advance. So having both algorithms is important.

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • SIAM SIAG/LA Prize in 2003 • Sequential version out in 2006 • More on performance later

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner • LAPACK Working Note (LAWN) #173 • On 26 real test matrices, speedups up to 11.9x, 4.4x average

Comparison of ScaLAPACK QR and new parallel multishift QZ Execution times in secsfor 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors. Note: work(QZ) > 2 * work(QR) but Time(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices. Adlerborn-Kågström-Kressner, SIAM PP’2006

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Howell / Fulton, Bischof / Lang • Full nonsymmetric eigenproblem: n=1500: 3.43x faster • HQR: 5x faster, Reduction: 14% faster • Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster • Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Howell / Fulton, Bischof / Lang • Recursive blocked layouts for packed formats: • Gustavson / Kågström / Elmroth / Jonsson/ • SIAM Review Article 2004

Recursive Layouts and Algorithms Still merges multiple elimination steps into a few BLAS 3 operations Best speedups for packed storage of symmetric matrices

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165)

With extra precise iterative refinement More Accurate: Solve Ax=b Conventional Gaussian Elimination 1/e e e = n1/22-24

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Condition estimate < 1/ e  reliable answer and tiny error bounds • No bad bounds in 6.2M tests • Can condition estimators lie?

Can condition estimators lie? • Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization • Demmel/Diament/Malajovich (FCM2001) • But what if matrix multiply costs O(n2)? • More later

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Get tiny componentwise bounds too • Each xi accurate • Slightly different condition number • Extends to Least Squares • Release in 2006

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • LAWNS # 169, 170 • Can be arbitrarily more accurate on tiny singular values • Yet faster than QR iteration!

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package • Quad, Quad-double, ARPREC, MPFR, … • Using Fortran 95 modules

Iterative Refinement: for speed • What if double precision much slower than single? • Cell processor in Playstation 3 • 256 GFlops single, 25 GFlops double • Pentium SSE2: single twice as fast as double • Given Ax=b in double precision • Factor in single, do refinement in double • If k(A) < 1/esingle, runs at speed of single • 1.9x speedup on Intel-based laptop • Applies to many algorithm, if difference large

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible

Missing Drivers in Sca/LAPACK

More missing drivers

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang

New GSVD Algorithm Given m x n A and p x n B, factor A = U ∑a X and B = V ∑b X Bai et al, UC Davis PSVD, CSD on the way

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat • Structured eigenproblems • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al • Selected matrix polynomials: • Mehrmann • How should we prioritize missing functions?

C(p)= -p1 -p2 … -pd 1 0 … 0 0 1 … 0 … … … … 0 … 1 0 New algorithm for roots(p) • To find roots of polynomial p • Roots(p) does eig(C(p)) • Costs O(n3), stable, reliable • O(n2) Alternatives • Newton, Jenkins-Traub, Laguerre, … • Stable? Reliable? • New: Exploit “semiseparable” structure of C(p) • Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration • Complexity drops from O(n3) to O(n2), stable in practice • Related work: Gemignani, Bini, Pan, et al • Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel

Goal 3 – Automate Performance Tuning • Widely used in performance tuning of Kernels • ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas • FFTW – Fast Fourier Transform – www.fftw.org • Spiral – signal processing - www.spiral.net • OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski

Optimizing blocksizes for mat-mul Finding a Needle in a Haystack – So Automate

Enhancing LAPACK and ScaLAPACK: Meeting the Future Challenges