Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, SrinivasanParthasarathy and P. Sadayappan Department of Computer Science The Ohio State University

Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work

Introduction • Sparse Matrix-Vector Multiplication (SpMV) • y = Ax, where A is a sparse matrix and x is a dense vector. • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods. • Focus of much research • Scientific Applications, e.g. finite element method • Graph Mining algorithms • PageRank, Random Walk with Restart, HITS • Industrial Strength Efforts • CPUs, Clusters • GPUs (focus of this talk)

Why GPUs • High Performance • Figure on right • High Productivity • CUDA (now) vs. OpenGL (other complications)

Background: CUDA Architecture • Programming Model (logical hierarchy): • Grid • Block • Thread • Kernel

Background: CUDA Architecture • Hardware (Physical): • A set of multiprocessors • 8 processors and 1 Instruction Unit • A warp = 32 threads, concurrently run the same instructions • Conditional divergence • Different warps: time-shared • Memory System • Global memory: coalescing • Shared memory: 16KB per block • Constant/Texture memory • Constant value, cached • 16KB constant cache; 6~8KB texture cache per multiprocessor • Registers

Power-law Graphs and Challenges • Power-law graphs • Large number of nodes with low degree • Few nodes with very high degree • Challenges for GPU based computation of SpMV on such graphs • Load balancing • Inefficient memory access • Conditional divergence • Problem Statement • Can we do better than competing industrial strength efforts for processing matrices representing such graphs? • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?

Key Insights from Benchmarking Texture cache size was not available Estimated to be 250 KB (=64,000 columns) Note entire X cannot fit on texture cache • Three kinds of memory accesses • Accesses to A, Accesses to x, and Writes to y • Previous methods have optimized accesses to A. • Observation 1: Each row accesses random elements in vector x • Observation 2: The accesses are non-coalesced – poor locality • Solution 1: Tile A by columns and store x on the texture cache

Tiling • Observation 3: Column lengths are power-law distribution • Many short columns, little re-use of x in these columns • No benefit from tiling • Solution 2: Reorder by column length and partially tile A • Un-tiled elements are computed separately.

Composite Storage of Tiles • Observation 4: Rows (non-zeros) in each tile follow power law! • Observation 5: Within each tile performance is limited by • load imbalance • non-coalesced global memory accesses • conditional thread divergence  serialization • Solution 3: Composite tile storage scheme • Basic observation from benchmarking study • Row major storage performs well on long rows (16 threads per row). • Column major storage performs well on short rows (1 thread per row).

Row and Column Composite Storage • Reorder the rows in each tile from long to short. • Rows are partitioned into workloads with similar size. • A thread warp is assigned to compute one workload. • A workload is a rectangle area of non-zeros with width w and height h • If w > h, row major storage • Else, column major storage

Experiments • Hardware Configuration • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory • All experiments are run with single process single GPU. Datasets

SpMV Kernel Power-law matrices

SpMV Kernel Unstructured matrices: non-power-law

Data Mining Applications • Given directed graph G = (V, E) , and adjacency matrix A • PageRank: • W is row normalization of A • c = 0.85, U is a n by n matrix with all elements set to 1/n. • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i. • W is column normalization of A • c = 0.9, the ith element in is 1, the others are all 0. • HITS: each web page is assigned an authority score and a hub score.

PageRank

Conclusions and Future Work • Architecture conscious optimizations for SpMV • Architecture features of GPU • Characteristics of graph mining applications • Significant performance improvement on power-law graph datasets. • Future work • Parameter auto-tuning based on non-zero distribution • Blocking and loop unrolling • Extension to distributed systems to handle larger datasets.

Thank you • Questions? • Acknowledgements: • Grants: • DOE Early Career Principal Investigator Award No. DE-FG02-04ER25611 • NSF CAREER Grant IIS-0347662

Backup slides: additional experiment results

Random Walk with Restart

HITS

Outline • Motivation and Background • Limitations of Previous Approach • Methods • Experiments • Conclusions and Future work

Limitations of Previous Work CSR: Imbalanced workload amongst threads, non- coalesced memory accesses. CSR-vector: many short rows, waste of threads • NVIDIA’s SpMV Library based on different storage formats of matrix A. • CSR • CSR kernel • CSR-vector kernel • Optimized CSR-vector Baskaran et al.

Limitation of Previous Work warp0 warp1 COO: thread divergence, low thread level parallelism • COO kernel • Each warp works on one interval • Warps run in parallel • With in one warp, threads do binary reduction, need to check whether two operands are from the same row

Limitation of Previous Work ELL: long rows can’t be bounded HYB: ELL part only covers small amount of computation, COO part is slow, increasing the ratio of ELL part introduces memory overhead. • ELL kernel • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k. • Data and index matrices are stored in column major, each thread works on one row. • HYB kernel: ELL + COO

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Presentation Transcript

Dense Matrix Algorithms

Scalable Data Mining

Chapter 2 Data Mining

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Mining Billion-Node Graphs - Patterns and Algorithms

Data Mining Tools

Fast Food

Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress

Web Mining : A Bird ’ s Eye View

Graph

Sensor data mining and forecasting

Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Sparse Representation for Image Reconstruction, and Face Recognition?

Mining Billion-node Graphs: Patterns, Generators and Tools

Mining Complex Types of Data

Higher Unit 3

Statistical Models in S

Automatic Performance Tuning Sparse Matrix Kernels

Practicing Multiplication