Best. Reference (CSR). Mflop/s 1190. Mflop/s 90. Dense (90% of non-zeros). 50. 190. Performance Tuning. TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
(90% of non-zeros)
TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine.
Trends and The Need for Automatically Tuned Sparse Kernels
Less than 10% of peak: Typical untuned sparse matrix-vector multiply (SpMV) performance is below 10% of peak on modern cache-based superscalar machines. With careful tuning, 2x speedups and 30% of peak or more are possible.
The optimal choice of tuning parameters can be surprising: (Left) A matrix that naturally contains 8x8 dense blocks. (Right) On an Itanium 2, the optimal block size of 4x2 achieves 1.1 Gflop/s (31% of peak) and is over 4x faster than the conventional unblocked (1x1) implementation.
Extra work can improve performance: Filling in explicit zeros (shown as x) followed by 3x3 blocking increases the number of flops by 1.5x for this matrix, but SpMV still runs in 1.5x less time than not blocking on a Pentium III because the raw speed in Mflop/s increases by 2.25x.
Search-based Methodology for Automatic Performance Tuning
Complex combinations of dense substructures arise in practice. We are developing tunable data structures and implementations, and automated tuning parameter selection techniques.
Off-line benchmarking characterizes the machine: For r x c register blocking, performance as a function of r and c varies across platforms. (Left) Ultra 3, 1.8 Gflop/s peak. (Right) Itanium 2, 3.6 Gflop/s peak.
Impact on Applications and Evaluation of Architectures
Current and Future Work
Before: Green + Red
After: Green + Blue
Potential improvements to Tau3P/T3P/Omega3P, SciDAC accelerator cavity design applications by Ko, et al., at the Stanford Linear Accelerator Center (SLAC): (Left) Reordering matrix rows and columns, based on approximately solving the Traveling Salesman Problem (TSP), improves locality by creating dense block structure. (Right) Combining TSP reordering, symmetric storage, and register-level blocking leads to uniprocessor speedups between 1.5–3.3x compared to a naturally ordered, non-symmetric blocked implementation.
Evaluating SpMV performance across architectures: Using a combination of analytical modeling of performance bounds and benchmarking tools being developed by SciDAC-PERC, we are studying the impact of architecture on sparse kernel performance.
for more information ...