BeBOP. Automatic Performance Tuning of Sparse Matrix Kernels. Richard Vuduc James Demmel Katherine Yelick Yozo Hida. Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee. Be rkeley B enchmarking and OP timization Group www.cs.berkeley.edu/~richie/bebop. Context
Automatic Performance Tuning of Sparse Matrix Kernels
Berkeley Benchmarking and OPtimization Group
The performance of many applications is dominated by a few computational kernels.
Extensions to New Kernels
Preliminary results for symmetric A, ATA, and triangular solve.
Integrating with applications, new kernels, further automation, understanding architectural implications.
An Approach to Automatic Tuning
For each kernel, identify and generate a space of implementations, and search for the best one.
Tuning Sparse Matrix-Vector Multiply
The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse.
Dense and sparse linear algebra
“Cryptokernels” (e.g., modular exponentiation)
Mflopsdense(r,c) / Fill(r,c)
Multiple vectors—Significant speedups are possible when multiplying by several vectors.
Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.
ATA times a vector—The matrix A is brought through the memory hierarchy only once.
Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix.
Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.”
Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right).
Exploiting symmetry—When A=AT, only half the matrix need be stored, and each element used twice.
SPARSITY—Performance improvement after run-time block-size selection.
Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.
Register blocking profile—One-time characterization of the machine (Mflop/s).
Observations and Experience
Performance tuning is tedious and time-consuming work.
This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others.
(SPARSITY system optimizations also apply to these kernels!)
Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform.
Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra-IIi based workstation. It is not obvious how to model this space.