790 likes | 1.14k Views
Sparse LA. Sathish Vadhiyar. Motivation. Sparse computations much more challenging than dense due to complex data structures and memory references Many physical systems produce sparse matrices. Sparse Matrix-Vector Multiplication. Cache in GPUs.
E N D
Sparse LA Sathish Vadhiyar
Motivation • Sparse computations much more challenging than dense due to complex data structures and memory references • Many physical systems produce sparse matrices
Cache in GPUs • Fermi has 16KB/48KB L1 cache per SM, and a global 768 KB L2 cache • Each cache line is 128 bytes, to provide high memory bandwidth • Thus when using 48KB cache, only (48KB/128bytes=384) cache lines can be stored • GPUs execute up to 1536 threads per SM • If all threads access different cache lines, 384 of them will get cache hits, and others will get cache miss • Thus threads in the same block should work on the same cache lines
Compressed Sparse Row (CSR) Format • SpMV (Sparse Matrix-Vector Multiplication) using CSR
Naïve CUDA Implementation • Assign one thread to each row • Define block size as multiple of warp size, e.g., 128 • Can handle max of 128x65535=8388480 rows, where 65535 is the max number of blocks
Naïve CUDA Implementation - Drawbacks • For large matrices with several elements per row, the implementation suffers from a high cache miss rate since cache can’t hold all cache lines being used • Thus coalescing/caching is poor for long rows • If nnz per row have high variance, warp divergence will occur
Thread Cooperation • Multiple threads can be assigned to work on the same row • Cooperating threads act on adjacent elements of the row; perform multiplication with elements of vector x; add up their results in shared memory using reduction • First thread of the group writes the result to vector y • If the number of cooperating threads, coop, is less than warp size, the synchronization between cooperating threads is implicit
Analysis • Same cache lines are used by cooperating threads • Improves coalescing/caching • If the length of the row is not a multiple of 32, can lead to warp divergence, and loss in performance Coop = 4 Coalesced access; Good caching Warp divergence
Granularity • If a group of cooperating threads act on only row, then the number of blocks required for entire matrix may be more than 65535 • Thus, more than one row per cooperating group can be processed • Number of rows processing by a cooperating group is denoted as repeat • A thread block processes repeat*blockSize/coop consecutive rows • An algorithm can be parametrized by blockSize, coop and repeat
References • Efficient Sparse Matrix-Vector Multiplication on cache-based GPUs. Reguly, Giles. InPar 2012.
Parallel solution of linear system with special matricesTridiagonal Matrices x1 x2 x3 . . xn b1 b2 b3 . . bn a1 h1 g2 a2 h2 g3 a3 h3 gn an = In general: gixi-1 + aixi + hixi+1 = bi Substituting for xi-1 and xi+1 in terms of {xi-2, xi} and {xi, xi+2} respectively: Gixi-2 + Aixi + Hixi+2 = Bi
Tridiagonal Matrices x1 x2 x3 . . xn B1 B2 B3 . . Bn A1 H1 A2 H2 G3 A3 H3 G4 A4 H4 Gn-2 An = Reordering:
Tridiagonal Matrices x2 x4 . xn x1 x3 . xn-1 B2 B4 . Bn B1 B3 . Bn-1 A2 H2 G4 A4 H4 Gn An A1 H1 G3 A3 H3 Gn-3 An-1 =
Tridiagonal Systems • Thus the problem of size n has been split into even and odd equations of size n/2 • This is odd–even reduction • For parallelization, each process can divide the problem into subproblems of smaller size and solve the subproblems • This is divide-and-conquer technique
Tridiagonal Systems - Parallelization • At each stage one representative process of the domain of processes is chosen • This representative performs the odd-even reduction of problem i to two problems of size i/2 • The problems are distributed to 2 representatives n 1 n/2 2 6 n/4 1 3 5 7 n/8 1 2 3 4 5 6 7 8
Sparse Cholesky • To solve Ax = b; • Most of the research and the base case are in sparse symmetric positive definite matrices • A = LLT; Ly = b; LTx = y; • Cholesky factorization introduces fill-in
3 7 1 3 7 1 6 8 6 8 4 10 4 10 9 2 9 2 5 5 Fill-in Fill:new nonzeros in factor
Permutation Matrix or Ordering • Thus ordering to reduce fill or to enhance numerical stability • Choose permutation matrix P so that Cholesky factor L’ of PAPT has less fill than L. • Triangular solve: L’y = Pb; L’Tz = y; x = PTz • The fill can be predicted in advance • Static data structure can be used – symbolic factorization
Steps • Ordering: • Find a permutation P of matrix A, • Symbolic factorization: • Set up a data structure for the Cholesky factor L of PAPT, • Numerical factorization: • Decompose PAPT into LLT, • Triangular system solution: • Ly = Pb; LTz = y; x = PTz.
Sparse Matrices and Graph Theory 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 1 3 6 3 6 3 6 6 5 5 5 5 7 7 7 7 4 4 4 4 2 2 G(A)
Sparse and Graph 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 6 6 5 7 7 7 F(A)
Ordering • The above order of elimination is “natural” • The first heuristic is minimum degree ordering • Simple and effective • But efficiency depends on tie breaking strategy • Difficult to parallelize!
Minimum degree ordering for the previous Matrix 1 2 3 4 5 6 7 6 Ordering – {2,4,5,7,3,1,6} No fill-in ! 5 7 3 4 2 1
Ordering • Another ordering is nested dissection (divide-and-conquer) • Find separator S of nodes whose removal (along with edges) divides the graph into 2 disjoint pieces • Variables in each pieces are numbered contiguously and variables in S are numbered last • Leads to bordered block diagonal non-zero pattern • Can be applied recursively • Can be parallelized using divide-and-conquer approach
Nested Dissection Illustration 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 5 10 15 20 25 S
Nested Dissection Illustration 1 2 21 11 12 3 4 22 13 14 5 6 23 15 16 7 8 24 17 18 9 10 25 19 20 S
Numerical Factorization cmod(j, k): modification of column j by column k, k < j cdiv(j) : division of column j by a scalar
Elimination Tree • T(A) has an edge between two vertices i and j, with i > j, if i = p(j), i.e., L(i, j) is first non-zero entry in the jth column below diagonal. i is the parent of j. 10 9 8 5 4 2 7 3 6 1
Parallelization of Sparse Cholesky • Most of the parallel algorithms are based on elimination trees • Work associated with two disjoint subtrees can proceed independently • Same steps associated with sequential sparse factorization • One additional step: assignment of tasks to processors
Ordering • 2 issues: - ordering in parallel - find an ordering that will help in parallelization in the subsequent steps
Ordering in Parallel – Nested dissection • Nested dissection can be carried in parallel • Also leads to elimination trees that can be parallelized during subsequent factorizations • But parallelization only in the later levels of dissection • Can be applied to only limited class of problems • More later….
Ordering for Parallel Factorization No agreed objective for ordering for parallel factorization: Not all orderings that reduce fill-in can provide scope for parallelization 7 1 2 3 6 4 5 5 6 7 4 Natural order 3 2 No fill. No scope for parallelization 1 Elimination Tree
Example (Contd..) 1 1 2 3 3 4 5 2 6 7 7 Nested dissection order 7 4 6 3 6 5 1 2 4 5 Elimination Tree Fill. But scope for parallelization
Ordering for parallel factorization – Tree restructuring • Decouple fill reducing ordering and ordering for parallel elimination • Determine a fill reducing ordering P of G(A) • Form the elimination tree T(PAPT) • Transform this tree T(PAPT) to one with smaller height and record the corresponding equivalent reordering, P’
Ordering for parallel factorization – Tree restructuring • Efficiency depends on if such an equivalent reordering can be found • Also on the limitations of the initial ordering, P • Only minor modifications to the initial ordering. Hence only limited improvement in parallelism • Algorithm by Liu (1989) based on elimination tree rotation to reduce the height • Algorithm by Jess and Kees (1982) based on chordal graph to reduce the height
Height and Parallel Completion time • Not all elimination trees with minimum heights give rise to small parallel completion times. • Let each node, v, in elimination tree be associated with x,y • x – time[v] or time for factorization of column v • y – level[v] • level[v] = time[v] if v is the root of elimination tree, time[v]+level[parent of v] otherwise • Represents minimum time to completion starting at node v • Parallel completion time – maximum level value among all nodes
Height and Parallel Completion time h i 5, 5 9 f a b c e d f g g e 5 8 6, 11 3, 8 3,3 e 9 h d 7 3, 11 6, 17 4 3, 6 5, 8 f d 8 4 i c 4, 21 3, 14 3 g c 6 7 3, 9 3 6, 14 3, 17 b 2 h b 3, 12 6 2 6, 20 a 2, 19 5 1 i a 4, 24 2, 14 1
Minimization of Cost • Thus some algorithms (Weng-Yang Lin, J. of Supercomputing: 2003) pick at each step for ordering, the nodes with the minimum cost (greedy approach)
Nested Dissection Algorithms • Use a graph partitioning heuristic to obtain a small edge separator of the graph • Transform the small edge separator into a small node separator • Number nodes of the separator last and recursively apply
K-L for ND • Form a random initial partition • Form edge separator by applying K-L to form partitions P1 and P2 • Let V1 in P1 such that nodes in V1 incident on atleast one edge in the separator set. Similarly V2 • V1 U V2 (wide node separator), • V1 or V2 (narrow node separator) by Gilbert and Zmijewski (1987)
Step 2: Mapping Problems on to processors • Based on elimination trees • Various strategies to map columns to processors based on elimination trees. • Two algorithms: • Subtree-to-Subcube • Bin-Pack by Geist and Ng
Naïve Strategy 2 1 0 3 2 1 0 0 3 3 2 2 1 1 1 1 0 0 0 0 0 0 0 0