The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACKwww.netlib.org/lapack-dev Jim Demmel UC Berkeley 23 Feb 2007

Outline • Motivation for new Sca/LAPACK • Challenges (or research opportunities…) • Goals of new Sca/LAPACK • Highlights of progress • With some excursions …

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >68M web hits @ Netlib (incl. CLAPACK, LAPACK95) • 35K hits/day • Many ways to improve them, based on • Own algorithmic research • Enthusiastic participation of research community • User/vendor survey • Opportunities and demands of new architectures, programming languages • New releases planned (NSF support)

Participants • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak • Other Academic Institutions • UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara • TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • CERFACS, LBL • Industrial Partners • Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI

Challenges: Increasing Parallelism In the Top500: In your Laptop (Intel just announced an 80-core, 1 Teraflop chip)

Challenges • Increasing parallelism • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5.5%/year • Floating point time << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Heterogeneity (performance and semantics) • Asynchrony • Unreliability

What do users want? • High performance, ease of use, … • Survey results at www.netlib.org/lapack-dev • Small but interesting sample • What matrix sizes do you care about? • 1000s: 34% • 10,000s: 26% • 100,000s or 1Ms: 26% • How many processors, on distributed memory? • >10: 34%, >100: 31%, >1000: 19% • Do you use more than double precision? • Sometimes or frequently: 16% • Would Automatic Memory Allocation help? • Very useful: 72%, Not useful: 14%

Goals of next Sca/LAPACK • Better algorithms • Faster, more accurate • Expand contents • More functions, more parallel implementations • Automate performance tuning • Improve ease of use • Better software engineering • Increased community involvement

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible (possible class projects)

Missing Routines in Sca/LAPACK

More missing routines

Goal 1: Better Algorithms • Faster • But provide “usual” accuracy, stability • … Or accurate for an important subclass • More accurate • But provide “usual” speed • … Or at any cost

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Howell / Fulton, Bischof / Lang • Extensions to QZ: • Kågström / Kressner • Recursive blocked layouts for packed formats: • Gustavson / Kågström / Elmroth / Jonsson/

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Faster and more accurate than previous algorithms • SIAM SIAG/LA Prize in 2006 • New sequential, first parallel versions out in 2006

Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

Flop Count Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

Run Time Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

MFlop Rates of Eigensolvers(2.2 GHz Opteron + ACML)

Exploiting GPUs • Numerous emerging co-processors • Cell, SSE, Grape, GPU, “physics coprocessor,” … • When can we exploit them? • LIttle help if memory is bottleneck • Various attempts to use GPUs for dense linear algebra • Bisection on GPUs for symmetric tridiagonal eigenproblem • Evaluate Count(x) = #(evals < x) for many x • Very little memory traffic • Speedups up to 100x (Volkov) • 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4 • Overall eigenvalue solver 6.8x faster

Parallel Runtimes of Eigensolvers(2.4 GHz Xeon Cluster + Ethernet)

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Newton’s method applied to solving f(x) = A*x-b = 0 • Extra-precise BLAS needed (LAWN#165)

With extra precise iterative refinement More Accurate: Solve Ax=b Conventional Gaussian Elimination 1/e e e = n1/22-24

Iterative Refinement: for speed • What if double precision much slower than single? • Cell processor in Playstation 3 • 256 GFlops single, 25 GFlops double • Pentium SSE2: single twice as fast as double • Given Ax=b in double precision • Factor in single, do refinement in double • If k(A) < 1/esingle, runs at speed of single • 1.9x speedup on Intel-based laptop • Applies to many algorithms, if difference large

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat • Structured eigenproblems • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al • Selected matrix polynomials: • Mehrmann

C(p)= -p1 -p2 … -pd 1 0 … 0 0 1 … 0 … … … … 0 … 1 0 New algorithm for roots(p) • To find roots of polynomial p • Roots(p) does eig(C(p)) • Costs O(n3), stable, reliable • O(n2) Alternatives • Newton, Jenkins-Traub, Laguerre, … • Stable? Reliable? • New: Exploit “semiseparable” structure of C(p) • Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration • Complexity drops from O(n3) to O(n2), stable in practice • Related work: Gemignani, Bini, Pan, et al • Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel

Goal 3 – Automate Performance Tuning • Widely used in performance tuning of Kernels • ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas • FFTW – Fast Fourier Transform – www.fftw.org • Spiral – signal processing - www.spiral.net • OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski

Optimizing blocksizes for mat-mul Finding a Needle in a Haystack – So Automate

Goal 3 – Automate Performance Tuning • Widely used in performance tuning of Kernels • 1300 calls to ILAENV() to get block sizes, etc. • Never been systematically tuned • Extend automatic tuning techniques of ATLAS, etc. to these other parameters • Automation important as architectures evolve • Convert ScaLAPACK data layouts on the fly • Important for ease-of-use too

ScaLAPACK Data Layouts 1D Cyclic 1D Block 2D Block Cyclic 1D Block Cyclic

Speedups for using 2D processor grid range from 2x to 8x Cost of redistributing from 1D to best 2D layout 1% - 10% Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect 2GB Memory

Conclusions • Lots to do in Dense Linear Algebra • New numerical algorithms • Continuing architectural challenges • Parallelism, performance tuning • Ease of use, software engineering • Grant support, but success depends on contributions from community • www.netlib.org/lapack-dev • www.cs.berkeley.edu/~demmel

Extra Slides

Fast Matrix Multiplication (1) (Cohn, Kleinberg, Szegedy, Umans) • Can think of fast convolution of polynomials p, q as • Map p (q) into group algebra Si pi ziC[G] of cyclic group G = { zi } • Multiply elements of C[G] (use divide&conquer = FFT) • Extract coefficients • For matrix multiply, need non-abelian group satisfying triple product property • There are subsets X, Y, Z of G where xyz = 1 with x  X, y  Y, z Z  x = y = z = 1 • Map matrix A into group algebra via SxyAxy x-1y, B into Sy’zBy’z y’-1z. • Since x-1y y’-1z = x-1z iff y = y’ we get SyAxy Byz = (AB)xz • Search for fast algorithms reduced to search for groups with certain properties • Fastest algorithm so far is O(n2.38), same as Coppersmith/Winograd

Fast Matrix Multiplication (2)(Cohn, Kleinberg, Szegedy, Umans) • Embed A, B in group algebra (exact) • Perform FFT (roundoff) • Reorganize results into new matrices (exact) • Multiple new matrices recursively (roundoff) • Reorganize results into new matrices (exact) • Perform IFFT (roundoff) • Extract C = AB from group algebra (exact)

Fast Matrix Multiplication (3)(Demmel, Dumitriu, Holtz, Kleinberg) • Thm 1: Any algorithm of this class for C = AB is “numerically stable” • || Ccomp - C || <= c • nd•e• || A || • || B || + O(e2) • c and d are “modest” constants • Like Strassen • Let  be the exponent of matrix multiplication, i.e. no algorithm is faster than O(n). • Thm 2: For all >0 there exists an algorithm with complexity O(n+) that is numerically stable in the sense of Thm 1.

Commodity Processor Trends Will our algorithms run at a high fraction of peak? Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency

Parallel Processor Trends Will our algorithms scale up to more processors? Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Faster and more accurate than previous algorithms • New sequential, first parallel versions out in 2006

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Timing of Eigensolvers(only matrices where time > .1 sec)

Accuracy Results (old vs new MRRR) || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Howell / Fulton, Bischof / Lang • Full nonsymmetric eigenproblem: n=1500: 3.43x faster • HQR: 5x faster, Reduction: 14% faster • Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster • Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: • Byers / Mathias / Braman • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana, Howell / Fulton, Bischof / Lang • Extensions to QZ: • Kågström / Kressner • LAPACK Working Note (LAWN) #173 • On 26 real test matrices, speedups up to 11.9x, 4.4x average

The Future of LAPACK and ScaLAPACK netlib/lapack-dev