yaSpMV: Yet Another SpMV Framework on GPUs

yaSpMV: Yet Another SpMV Framework on GPUs Shengen Yan, Chao Li, Yunquan Zhang, Huiyang Zhou

Introduction • Sparse Matrix-Vector Multiplication • spmv is a very important linear algebra algorithm • Serial implementation is quite simple // A*x=y, where A is stored in the CSR format. for (i = 0; i < m; ++i) { double y0 = y[i]; for (k = rowptr[i]; k < rowptr[i+1]; ++k) y0 = value[k] * x[column_index[k]]; y[i] = y0; } • There are many work involved in its optimization on both CPUs and GPUs • Many formats have been proposed.

Introduction • Parallel implementation: two challenges • Bandwidth • the upper bound of flop:byte ratio is 0.25 • Load imbalance • Different number of non-zeroes in different rows • Worse on GPUs x = 3 6 9 4 7 1 3 8 4

Executive Summary • BCCOO format • addressing the bandwidth challenge • Customized efficient segmented scan/sum • addressing load imbalance problem • very efficient • Results (GTX 680) • vs. CUSPARSE V5.0 • up to 229% and 65% on average improvement • vs. clSpMV • up to 195% and 70% on average improvement

Outline • Introduction • Formats for SpMV • addressing the bandwidth challenge • Efficient Segmented Sum/Scan for SpMV • Auto-Tuning Framework • Experimentation • Conclusions

COO format [3 6 9 5 1 4 7 2 3 5 4 7 1 3 8 4] [0 0 0 1 1 1 2 2 2 2 3 3 3 3 3 3] [2 6 7 2 3 6 4 5 6 7 0 1 4 5 6 7] COO format of matrix A

Blocked COO (BCOO) format

Blocked COO (BCOO) format [ ][ ] 0 5 1 0 1 BCOO format block size 2x2

Blocked COO (BCOO) format [ ][ ] 3 0 5 1 6 9 4 0 0 0 1 3 BCOO format block size 2x2

Blocked COO (BCOO) format [ ][ ] 35 8 4 3 0 5 1 6 9 4 0 7 2 1 3 0 0 4 7 0 0 1 1 1 1 3 0 2 3 BCOO format block size 2x2

Blocked compressed COO (BCCOO) format Row Index Compression Ratio: 1/32 Bit Integer [ ] Difference value =[ ] Bit Flag (flipped)=[ ] [ ] 35 8 4 3 0 5 1 6 9 4 0 7 2 1 3 0 0 4 7 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 3 0 2 3 BCCOO format block size 2x2

Formats for SpMV • Extensions of BCCOO format • BCCOO+ format • Rearrange the non-zeros blocks. • Relief the irregular access to the vector • Column index compression • Using difference function on the column index.

Example matrix • Assume there are 4 threads [11 1 1 0 1 0 1 1 0 1 1 1 1 1 0] BCCOO format of matrix B (Block size 1x1)

Auxiliary Information for SpMV: Result Entry • Getting the location of the first result generated by each thread in the output array. That’s to say to compute the row index that the first result in each thread belongs to. • Only need to count the zero number in the bit flag array of the previous threads. [1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0] 0 0 3 2 Thread 0 Thread 1 Thread 2 Thread 3

Outline • Introduction • Formats for SpMV • Efficient Segmented Sum/Scan for SpMV • addressing load imbalance problem • Auto-Tuning Framework • Experimentation • Conclusions

Even workload partition • No workload imbalance Non-zero Blocks workgroups workgroup 0 workgroup 1 workgroup 3 workgroup 2 threads T1 T2 T3 T0

Efficient Segmented Sum/Scan for SpMV • Three logic steps • Read the data and multiply them with the corresponding vector values. • Perform a segmented sum/scan using the bit flag array from our BCCOO/BCCOO+ format • Results combination and write back the results to global memory. • All these three steps are implemented in one kernel

Step 1 Read the data and multiply with vector values • Ex: 4 Threads [11 1 1 0 1 0 1 1 0 1 1 1 1 1 0] BCCOO format of matrix B

Step 1 Read the data and multiply with vector values • Ex: 4 Threads Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3] 0 2 4 6 2 6 4 7 X =

Step 2 Segmented sum/scan • Three types of rows in our algorithm • All the non-zeros of the row are in the same thread. • Serial segmented sum/scan in threads • A row spansmultiple threads. • + Parallel segmented sum/scan among threads. • A row spans multiple workgroups. • + Cross workgroup synchronization (details in paper)

Step 2 Segmented sum/scan • 1) Serial segmented sum/scan in each thread Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3] Serial Segmented Scan (intermediate[-1]=0): Intermediate[i] = intermediate[i-1] * BitFlag[i-1] + Intermediate[i] [ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ] Scan Scan Scan Scan 27 5 33 36

Step 2 Segmented sum/scan • 2) Generate last partial sum and perform parallel segmented scan among threads for the last partial sum [ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ] Partial Sums Head Flag(Exist ‘0’ in Bit Flag?) Partial Sums Scan Scan Scan Scan 78 36 99 0 1 1 1 0 Parallel segmented scan 78 36 99 0

Step 2 Segmented sum/scan • 2) Generate last partial sum and perform parallel segmented scan among threads for the last partial sum [ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ] Partial Sums Head Flag(Exist ‘0’ in Bit Flag?) Partial Sums 1 Scan Scan Scan Scan 78 36 99 0 0 1 1 1 0 Parallel segmented scan 135 78 36 99 0

Step 3Results combination and write the results to global memory. Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3] • [ 1 1 1 1 1 1 1 1 1 1 1 1 ] • Partial Sums 78 36 99 0 • Combined results 0 0 0 0 27 33 56 97 + + + 105 92 33 196 2 3 0 0+1 105 92 33 196

Auto-Tuning Framework • In order to generate the optimal kernel code, we employ the auto-tuning technique to search the best parameters. Average auto-tuning time: 13 seconds Auto-tuning speed: ~1 million non-zeros per seconds Tunable parameters

Experiments • Experimental Methodology • We have implemented our proposed scheme in OpenCL. • We have evaluated our scheme on GTX 480 and GTX 680 using 20 real world matrices. • Comparison library • CUSPARSE V5.0 (Nvidia official SpMV library) • CUSP (SC 09) • clSpMV (ICS 12)

Used Matrices

GFLOPS Performance results on Kepler (GTX 680)

GFLOPS Average Performance Improvement: 65% over CUSPARSE, 70% over clSpMV COCKTAIL, 88% over clSpMV best single, 150% over CUSP Performance results on Kepler (GTX 680)

GFLOPS Performance breakdown on Kepler (GTX680)

Average Performance Improvement (vs. COO format): +BCCOO: 66% +Efficient Segmented Sum/Scan: 192% +Adjacent Synchronization: 212% +Fine-Grain Optimizations: 257% GFLOPS Performance breakdown on Kepler (GTX680)

Relative memory footprint Average memory footprint consumption: vs. COO: 60% vs. ELL: 19% vs. Cocktail:79% vs. Best single: 69% Relative memory footprint of different formats

Conclusions • The BCCOO format • Addressed the memory bandwidth problem • The customized matrix-based segmented sum/scan algorithms • Addressed the work load imbalance problem • Only need to invoke one kernel. • Very efficient: used a lot of optimization approaches. • Results (GTX 680) • Vs. CUSPARSE V5.0 • up to 229% and 65% on average improvement • Vs. clSpMV • up to 195% and 70% on average improvement • Code is available online • http://code.google.com/p/yaspmv/

Thanks & Question?

COO format

COO format [ ] [] [ ] 3 0 2

COO format [ ] [] [ ] 3 6 0 0 2 6

Step 2 Segmented sum/scan • 3) Accumulating partial sums across workgroups Generate Partial Sums Generate Partial Sums Generate Partial Sums Generate Partial Sums P0 P1 P2 P3 Step 3 Step 3 Step 3 Step 3 Using Adjacent Synchronization

Fine-grained optimizations • Texture memory for vector read • Cut the adjacent synchronization chain as early as possible • Remove the parallel segmented scan if possible • If the number of columns is smaller than 65535 short type column index array may be helpful to decrease the memory traffic.

GFLOPS Average Performance Improvement: 42% over CUSPARSE, 40% over clSpMV COCKTAIL, 60% over clSpMV best single, 74% over CUSP Performance results on Fermi (GTX480)

Absolute memory footprint consumption of COO,BCOO,BCCOO formats

yaSpMV: Yet Another SpMV Framework on GPUs