- 79 Views
- Uploaded on
- Presentation posted in: General

Automatic Performance Tuning of Sparse Matrix Kernels

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

BeBOP

Automatic Performance Tuning of Sparse Matrix Kernels

Richard Vuduc

James Demmel

Katherine Yelick

Yozo Hida

Michael deLorimier

Shoaib Kamil

Rajesh Nishtala

Benjamin Lee

Berkeley Benchmarking and OPtimization Group

www.cs.berkeley.edu/~richie/bebop

Context

The performance of many applications is dominated by a few computational kernels.

Extensions to New Kernels

Preliminary results for symmetric A, ATA, and triangular solve.

Future Work

Integrating with applications, new kernels, further automation, understanding architectural implications.

An Approach to Automatic Tuning

For each kernel, identify and generate a space of implementations, and search for the best one.

Tuning Sparse Matrix-Vector Multiply

The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse.

- Applications need fast kernels
- Scientific computing; information retrieval
Dense and sparse linear algebra

- Multimedia; audio and image processing
Fast transforms

- Databases
Sorting

- Security
“Cryptokernels” (e.g., modular exponentiation)

- Scientific computing; information retrieval
- Hardware complexity is increasing
- Microprocessor performance difficult to model
- Widening processor-memory gap; deep memory hierarchies

- Implementation space
- Conceptually, the set of “interesting” implementations
- Depend on kernel and input
- May vary:
- Instruction mix and order
- Memory access patterns
- Data structures and precisions
- Mathematical formulation

- Searching
- How?
- Exhaustively
- Heuristically, guided by models

- When?
- Once per kernel and platform
- At compile time
- At run-time
- Hybrid approaches

- How?

- Sparse matrix data structures
- Store only non-zeros
- Data structure overhead + irregular memory access

- Implementation space
- Register blocking: exploit existing dense blocks
- Cache blocking: create better locality in x, y
- Multiple vectors: reuse elements of A

- Searching example: selecting a register block size
- Off-line: One-time characterization of performance on a dense matrix stored in sparse format for all r, c
- At run-time: Estimate rxc fill, and choose r, c to maximize
Mflopsdense(r,c) / Fill(r,c)

Multiple vectors—Significant speedups are possible when multiplying by several vectors.

Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.

ATA times a vector—The matrix A is brought through the memory hierarchy only once.

Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix.

Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.”

Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right).

Exploiting symmetry—When A=AT, only half the matrix need be stored, and each element used twice.

SPARSITY—Performance improvement after run-time block-size selection.

Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.

Register blocking profile—One-time characterization of the machine (Mflop/s).

Observations and Experience

Performance tuning is tedious and time-consuming work.

This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others.

- Symmetric sparse matrix-vector multiply
- Only store half of the non-zeros
- Reuse each stored element twice

- Sparse triangular solve
- Compute T-1x, where T is a sparse triangular matrix
- Exploit naturally occurring dense structure when T comes from certain applications (LU factorization)

- Multiplying ATA by a vector
- A can be brought through the memory hierarchy once
- Arises in linear programming problems, among others
(SPARSITY system optimizations also apply to these kernels!)

- Integration with applications and existing software libraries
- Creating (via reordering) or exploiting other matrix structures
- New sparse kernels (e.g., powers Ak, triple product ATMA)
- Further automation: generating implementation generators
- Understanding performance in terms of underlying architectures

Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform.

Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra-IIi based workstation. It is not obvious how to model this space.