James Demmel Computer Science and Mathematics Departments UC Berkeley SciDAC 2005

James Demmel Computer Science and Mathematics Departments UC Berkeley SciDAC 2005 The Future of Numerical Linear Algebra Libraries:Automatic Tuning of Sparse Matrix KernelsThe Next LAPACK and ScaLAPACK

Richard Vuduc (LLNL), James Demmel, Katherine Yelick Hormozd Gahvari, Mark Hoemmen, Ankit Jain, Ben Lee, Scott Lindeneau, Rajesh Nishtala, Wei Tu Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University of California, Berkeley OSKI: Optimized Sparse Kernel Interface

Motivation for Automatic Performance Tuning • Sparse matrix operations consume most solver time • Sparse matrix-vector multiply (SpMV) • SpMV: runs at 10% of machine peak or less • Improving SpMV’s performance is hard • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Tuning becoming more difficult over time • Our approach: empirical modeling and search • Up to 4x speedups and 31% of peak for SpMV • Several other kernels: triangular solve, ATA*x, Ak*x • Proof-of-concept: Demonstrate with Omega3P, T3P • OSKI Library released, will be integrated into PETSc

8x8 dense substructure Motivation: Tuning is difficult • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

Mflop/s Best: 4x2 Reference Mflop/s Speedups on Itanium 2: The Need for Search

SpMV Performance—raefsky3

Symmetric peak = 612 MFlops Opteron (1.4GHz, 2.8GFlop peak) Nonsymmetric peak = 504 MFlops Beats ATLAS DGEMV’s 365 Mflops

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Flop rate 2.25x higher

OSKI: Optimized Sparse Kernel Interface • Sparse kernels tuned for user’s matrix & machine • Hides complexity of run-time tuning • Low-level BLAS-style functionality • Sparse matrix-vector multiply (SpMV), triangular solve (TrSV), … • Includes fast locality-aware kernels: ATAx, … • Initial target: cache-based superscalar uniprocessors • For “advanced” users & solver library writers • Available as stand-alone open-source library (BSD license) • PETSc extension in progress • Written in C (can call from Fortran)

Optimizations Available in the Initial Release • Optimizations for SpMV (bold heuristics) • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls

Cost of Tuning • Non-trivial run-time tuning cost: up to ~40 mat-vecs • Dominated by conversion time • User calls “tune” routine explicitly • Exposes cost • Tuning time limited using estimated workload • Provided by user or inferred by library • User may save tuning results • To apply on future runs with similar matrix • Stored in “human-readable” format

Features • Explicit Hints • Can suggest particular tuning technique • Implicit Tuning: Ask library to infer workload • Library profiles all kernel calls • May periodically re-tune • Scripting language for selecting customized transformations • Mechanism to save/restore transformations • “Plug-in” extensibility • Very advanced users may customize library (at run-time)

Example applications • T3P – Accelerator Design – Ko • Register blocking, Symmetric Storage, Multiple vector • 1.68x faster on Itanium 2 for one vector • 4.4x faster for 8 vectors • Omega3P – Accelerator Design – Ko • Register blocking, Symmetric storage, Reordering • 2.1x faster on Power4 • Semiconductor Industry: • 1.9x speedup over SPOOLES in CG at design firm

Current and Future Work (1) • Release 1.0 and docs available at bebop.cs.berkeley.edu/oski • Comments on interface welcome! • Future work • PETSc integration • Port to additional architectures • Vectors • SMPs • Distributed memory • Additional tuning heuristics

Current and Future Work (2) • Incorporation into HPCS Benchmark • Evaluate platforms based on tuned SpMV performance • Tuning higher level algorithms and kernels like [Ax,A2x,…,Akx] • Models indicate large speedups possible • Tuning collective communication routines • 2x speedups in NAS FT using UPC/GASNet

Awards • Best Paper, Intern. Conf. Parallel Processing, 2004 • “Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” • Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries,2003 • Best Student Presentation too, to Richard Vuduc • “Automatic performance tuning and analysis of sparse triangular solve” • Finalist, Best Student Paper, Supercomputing 2002 • To Richard Vuduc • “Performance Optimization and Bounds for Sparse Matrix-vector Multiply” • Best Presentation Prize, MICRO-33: 3rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000 • To Richard Vuduc • “Statistical Modeling of Feedback Data in an Automatic Tuning System”

The Future of LAPACK and ScaLAPACKwww.netlib.org/lapack-dev Jim Demmel - UC Berkeley Jack Dongarra – U Tennessee Knoxville

Outline • Motivation • Participants • Goals • Better numerics (faster and more accurate algorithms) • Expand contents (more functions, more parallel implementations) • Improve ease of use • Better software engineering • Automate performance tuning • Better maintenance and support • Increase community involvement • Questions for the audience

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >50M web hits @ Netlib (incl. CLAPACK, LAPACK95) • Many ways to improve them, based on • Own algorithmic research • Enthusiastic participation of research community • On-going user/vendor survey (url below) • Opportunities and demands of new architectures, programming languages • DOE SciDAC needs • New releases planned (NSF support) • Your feedback desired • www.netlib.org/lapack-dev

Success Stories (with NERSC, LBNL) >250 users of ScaLAPACK and LAPACK at NERSC, 2005 ScaLAPACK Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code

Participants • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov • Other Academic Institutions • UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara • TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • CERFACS, LBL • Industrial Partners • Cray, HP, Intel, MathWorks, NAG, SGI

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability • MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: Byers / Mathias / Braman • Extensions to QZ: Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton • Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability • MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: Byers / Mathias / Braman • Extensions to QZ: Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton • Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/ • New: Most accurate algorithm providing “standard” speed • Iterative refinement for Ax=b, least squares • Assume availability of Extra Precise BLAS (Li/Hida/…) • www.netlib.org/blas/blast-forum/ • Retirement of QR-based SVD: Drmac/Veselic

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability • MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems • Up to 10x faster HQR: Byers / Mathias / Braman • Extensions to QZ: Kågström / Kressner • Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton • Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/ • New: Most accurate algorithm providing “standard” speed • Iterative refinement for Ax=b, least squares • Assume availability of Extra Precise BLAS (Li/Hida/…) • www.netlib.org/blas/blast-forum • Retirement of QR - based SVD: Drmac/Veselic • What is not fast or accurate enough?

Perhaps Automate? What goes into Sca/LAPACK? For all linear algebra problems For all mathematical structures For all data structures and types For all programming models (SW & HW) Produce a (stable, tuned) algorithm (including condition estimates, etc) Need to prioritize

Goal 2 – Expanded Content • Ideal: Make content of ScaLAPACK mirror LAPACK as much as possible • Add New functions (examples) • Updating / downdating of factorizations: Stewart, Langou • More generalized SVDs: Bai , Wang • More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat • Structured eigenproblems • O(n2) version of roots(p) – Gu, Chandrasekaran, Zhu et al • Selected matrix polynomials: Mehrmann, Bai • How should we prioritize missing functions?

Goal 3: Improved Ease of Use (1) • Which interface do you prefer? A \ B CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO) • It depends on who you ask.

Goal 3: Improved Ease of Use (2) • Support both experts and “typical” users (more of latter!) • Easy interfaces vs access to details & performance • No universal agreement on “easiest interface” • Leave decision to higher level packages • Expert users want access to details • Higher performance, own storage management, numerical issues • “Typical” users want simplicity, even if lower performance • Tentative decision (depends on community feedback) • Keep simple driver / expert driver / computational routines • Coordinate with higher level tool builders for usability • PETSc, Matlab, … • Add matrix redistribution routines to ScaLAPACK to handle general input layouts without performance penalties

Goal 4: Better Software Engineering (1) • Ideal • Express all algorithms in “high level form” • Generate all implementations (different HW & SW) automatically • Automatic tuning too • Reality • Need good tools now to produce useful code in 3-4 years • No fully adequate tools exist (open research question) • Telescoping languages, FLAME, Bernoulli, …

Goal 4: Better Software Engineering (2) • Tentative Proposal for new LAPACK • F95 core using subset of F95 features • “Friendly” wrappers in multiple languages • Use Babel, CCA • How much performance would we lose by relegating all SMP / Multicore parallelism to BLAS? • Need experimental data

Goal 4: Better Software Engineering (3) • ScaLAPACK is more demanding • LAPACK on large and changing set of architectures and communication networks • Large software design space • Languages: F95, CAF, UPC, … • Communication substrates: MPI (many flavors), GASNet, … • Programming styles: BSP, aggressive overlap of communication & computation, … • Many architectures (should track hardware) • Sequential, multicore, SMP, Distributed memory • As function of algorithm, n, p, architecture, software • What is tradeoff between performance and programming effort? • Collecting data to decide what to do • How should we tradeoff performance and SWE? • Some users value performance more than others • Higher performance may mean fewer routines produced

Goal 5 – Automate Performance Tuning • Not just BLAS • Many calls to ILAENV() to get block sizes, etc. • Not systematically tuned for new architectures • Extend automatic tuning techniques of ATLAS, etc. to these other parameters • Automation important as architectures evolve

Conclusions and Questions • Lots to do in Dense Linear Algebra • New numerical algorithms • Continuing challenges • Parallelism, performance tuning, software engineering • Questions • What (new) functions are most important to include in Sca/LAPACK? • How should we tradeoff performance / accuracy / ease of SWE?

The End bebop.cs.berkeley.edu www.netlib.org/lapack-dev

James Demmel Computer Science and Mathematics Departments UC Berkeley SciDAC 2005