CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

CS 267Sparse Matrices:Sparse Matrix-Vector Multiplyfor Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07 CS267 Lecture 16

Phillip Colella’s “Seven dwarfs” High-end simulation in the physical sciences = 7 numerical methods: • Structured Grids (including locally structured grids, e.g. AMR) • Unstructured Grids • Fast Fourier Transform • Dense Linear Algebra • Sparse Linear Algebra • Particles • Monte Carlo • Add 4 for embedded 8. Search/Sort 9. Finite State Machine 10. Filter 11. Combinational logic • Then covers all 41 EEMBC benchmarks • Revise 1 for SPEC • 7. Monte Carlo => Easily parallel (to add ray tracing) • Then covers 26 SPEC benchmarks Well-defined targets from algorithmic, software, and architecture standpoint Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004 CS267 Lecture 16

ODEs and Sparse Matrices • All these problems reduce to sparse matrix problems • Explicit: sparse matrix-vector multiplication (SpMV). • Implicit: solve a sparse linear system • direct solvers (Gaussian elimination). • iterative solvers (use sparse matrix-vector multiplication). • Eigenvalue/vector algorithms may also be explicit or implicit. • Conclusion: SpMV is key to many ODE problems • Relatively simple algorithm to study in detail • Two key problems: locality and load balance CS267 Lecture 16

SpMV in Compressed Sparse Row (CSR) Format CSR format is one of many possibilities x Representation of A A y Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j) Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] CS267 Lecture 16

Motivation for Automatic Performance Tuning of SpMV • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Our approach: empirical performance modeling and algorithm search CS267 Lecture 16

SpMV Historical Trends: Fraction of Peak CS267 Lecture 16

Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem CS267 Lecture 16

Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure CS267 Lecture 16

Taking advantage of block structure in SpMV • Bottleneck is time to get matrix from memory • Only 2 flops for each nonzero in matrix • Don’t store each nonzero with index, instead store each nonzero r-by-c block with index • Storage drops by up to 2x, if rc >> 1, all 32-bit quantities • Time to fetch matrix from memory decreases • Change both data structure and algorithm • Need to pick r and c • Need to change algorithm accordingly • In example, is r=c=8 best choice? • Minimizes storage, so looks like a good idea… CS267 Lecture 16

Best: 4x2 Reference Speedups on Itanium 2: The Need for Search Mflop/s CS267 Lecture 16 Mflop/s

Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s CS267 Lecture 16

Ultra 2i - 9% 63 Mflop/s Ultra 3 - 5% 109 Mflop/s SpMV Performance (Matrix #2): Generation 2 35 Mflop/s 53 Mflop/s Pentium III - 19% Pentium III-M - 15% 96 Mflop/s 120 Mflop/s CS267 Lecture 16 42 Mflop/s 58 Mflop/s

Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s CS267 Lecture 16 42 Mflop/s 58 Mflop/s

Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s CS267 Lecture 16 103 Mflop/s 276 Mflop/s

Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium 2 - 33% 247 Mflop/s 1.2 Gflop/s CS267 Lecture 16 107 Mflop/s 190 Mflop/s

Another example of tuning challenges • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M CS267 Lecture 16

Zoom in to top corner • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M CS267 Lecture 16

3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in” CS267 Lecture 16

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher CS267 Lecture 16

Automatic Register Block Size Selection • Selecting the r x c block size • Off-line benchmark • Precompute Mflops(r,c) using dense A for each r x c • Once per machine/architecture • Run-time “search” • Sample A to estimate Fill(r,c) for each r x c • Run-time heuristic model • Choose r, c to minimize time ~Fill(r,c) /Mflops(r,c) CS267 Lecture 16

Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix • Fraction of matrix to sample: sÎ [0,1] • Cost ~ O(s * nnz) • Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals • Idea: Monitor variance • Cost of tuning • Lower bound: convert matrix in 5 to 40 unblocked SpMVs • Heuristic: 1 to 11 SpMVs CS267 Lecture 16

Accuracy of the Tuning Heuristics (1/4) See p. 375 of Vuduc’s thesis for matrices CS267 Lecture 16 NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)

Accuracy of the Tuning Heuristics (2/4) CS267 Lecture 16

Accuracy of the Tuning Heuristics (3/4) CS267 Lecture 16

Accuracy of the Tuning Heuristics (3/4) DGEMV CS267 Lecture 16

Upper Bounds on Performance for blocked SpMV • P = (flops) / (time) • Flops = 2 * nnz(A) • Lower bound on time: Two main assumptions • 1. Count memory ops only (streaming) • 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge minimum access “latency” ai at Li cache & amem • e.g., Saavedra-Barrera and PMaC MAPS benchmarks CS267 Lecture 16

Example: L2 Misses on Itanium 2 CS267 Lecture 16 Misses measured using PAPI [Browne ’00]

Example: Bounds on Itanium 2 CS267 Lecture 16

Summary of Other Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.8x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB CS267 Lecture 16

SPMV for Shared Memory and Multicore • Data Structure Transformations • Thread blocking • Cache blocking • Register Blocking • Format selection • Index size reduction • Kernel Optimizations • Prefetching • Loop structure CS267 Lecture 16

Thread Blocking • Load Balancing • Evenly divide number of nonzeros • Exploit NUMA memory systems on multi-socket SMPs • Must pin threads to cores AND • pin data to sockets CS267 Lecture 16

Naïve Approach • R x C processor grid • Each covers the same number of rows and columns. • Potentially unbalanced CS267 Lecture 16

Load Balanced Approach • R x C processor grid • First, block into rows • same number of nonzeros in each of the R blocked rows • Second, block within each blocked row • Not only should each block within a row have ~same number of nonzeros, • But all blocks should have ~same number of nonzeros • Third, prune unneeded rows & columns • Fourth, re-encode the column indices to be relative to each thread block. CS267 Lecture 16

Memory Optimizations • Cache blocking • Performed for each thread block. • Chop into blocks so entire source vector fits in cache • Prefetching • Insert explicit prefetch operations to mask latency to memory • Tune prefetch distance/time using search • Register blocking • As in OSKI, but done separately per cache block • Simpler heuristic: choose block size that minimize total storage • Index compression • Use 16b ints for indices in blocks less than 64K wide CS267 Lecture 16

memplus.rua Naive Naive Naive Naive Register Blocking Register Blocking Register Blocking Register Blocking 2x 1.9x 1.4x 1.5x 258 324 297 430 351 439 269 623 Naive Naive Naive Naive 476 612 467 493 1372 695 513 460 Software Prefetch Software Prefetch Software Prefetch Software Prefetch raefsky3.rua 1.5x 1.4x 2.3x 3.2x 1.6x 1.4x Quad Socket, Single Core Opteron @ 2.4GHz Dual Socket, Dual Core Opteron @ 2.2GHz 1 thread Performance (preliminary) CS267 Lecture 16

memplus.rua Naive Naive Naive Naive T,R Blocked T,R Blocked T,R Blocked T,R Blocked - - - - - - - - Naive Naive Naive Naive - - - - 495 0.96x 755 1.6x 1284 1.85x 1639 1.2x Software Prefetch Software Prefetch Software Prefetch Software Prefetch raefsky3.rua Quad Socket, Single Core Opteron @ 2.4GHz Dual Socket, Dual Core Opteron @ 2.2GHz 2 thread Performance (preliminary) CS267 Lecture 16

memplus.rua Naive Naive Naive Naive T,R Blocked T,R Blocked T,R Blocked T,R Blocked - - - - - - - - Naive Naive Naive Naive - - - - 985 2.0x 1369 3.0x 1911 2.75x 3148 2.3x Software Prefetch Software Prefetch Software Prefetch Software Prefetch raefsky3.rua Quad Socket, Single Core Opteron @ 2.4GHz Dual Socket, Dual Core Opteron @ 2.2GHz 4 thread Performance (preliminary) CS267 Lecture 16

memplus.rua Naive Naive Naive Naive T,R Blocked T,R Blocked T,R Blocked T,R Blocked 430 324 258 297 - - - - Naive Naive Naive Naive - - - - 1369 3148 1911 985 Software Prefetch Software Prefetch Software Prefetch Software Prefetch raefsky3.rua Quad Socket, Single Core Opteron @ 2.4GHz Dual Socket, Dual Core Opteron @ 2.2GHz Speedup for the best combination of NThreads, blocking, prefetching, … 3.8x 4.2x 6.4x 7.3x CS267 Lecture 16

Distributed Memory SPMV • y = A*x, where A is a sparse n x n matrix • Questions • which processors store • y[i], x[i], and A[i,j] • which processors compute • y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product • Partitioning • Partition index set {1,…,n} = N1  N2  …  Np. • For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x • “owner computes” rule: Processor k compute the y[i]s it owns. x P1 P2 P3 P4 y May require communication CS267 Lecture 16

Two Layouts • The partitions should be by nonzeros counts, not rows/columns • 1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P • 2D Partition: reductions scale with log sqrt(P), but needs to keep ~= nonzeros for load balance x x P1 P2 P3 P4 P1 P2 P3 P4 y y CS267 Lecture 16

Summary • Sparse matrix vector multiply critical to many applications • Performance limited by memory systems (and perhaps network) • Cache blocking, register blocking, prefetching are all important • Autotuning can be used, but need matrix structure CS267 Lecture 16

Extra Slides Including: How to use OSKI CS267 Lecture 16

Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Example: Sparse Triangular Factor CS267 Lecture 16

“axpy” dot product Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT • Only fetch A from memory once … … • Register-level: aiT to be r´c block row, or diag row CS267 Lecture 16

Example: Combining Optimizations • Register blocking, symmetry, multiple (k) vectors • Three low-level tuning parameters: r, c, v X k v * r c += Y A CS267 Lecture 16

Example: Combining Optimizations • Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] • Symmetric, blocked, 1 vector • Up to 2.6x over nonsymmetric, blocked, 1 vector • Symmetric, blocked, k vectors • Up to 2.1x over nonsymmetric, blocked, k vecs. • Up to 7.3x over nonsymmetric, nonblocked, 1, vector • Symmetric Storage: up to 64.7% savings CS267 Lecture 16

Potential Impact on Applications: T3P • Application: accelerator design [Ko] • 80% of time spent in SpMV • Relevant optimization techniques • Symmetric storage • Register blocking • On Single Processor Itanium 2 • 1.68x speedup • 532 Mflops, or 15% of 3.6 GFlop peak • 4.4x speedup with multiple (8) vectors • 1380 Mflops, or 38% of peak CS267 Lecture 16

Potential Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • Nodes = columns of A • Weights(u, v) = no. of nz u, v have in common • Tour = ordering of columns • Choose maximum weight tour • See [Pinar & Heath ’97] • 2.1x speedup on Power 4, but SPMV not dominant CS267 Lecture 16

CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

Presentation Transcript

Perceptual Categories: Old and gradient, young and sparse.

Iterative Methods for System of Equations

Raven s Progressive Matrices

VECTOR CALCULUS

Csi565: Graph Theory for Testers

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Review on Linear Algebra

An Introduction To Matrix Decomposition and Graphical Model

Elementary Linear Algebra

Don’t click me

Perceptual Categories: Old and gradient, young and sparse.

Linear Algebra

Matrices, connections, matings and reasoning: The connection method

Constraint Program Solvers

Microseed it! Introduction to microseeding Patrick Shaw Stewart

Objective

Chapter 3 Vector Spaces

Vector Calculus

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Graph