1 / 28

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining. Xintian Yang , Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science The Ohio State University. Outline. Motivation and Background Methods Experiments Conclusions and Future work.

keefe
Download Presentation

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, SrinivasanParthasarathy and P. Sadayappan Department of Computer Science The Ohio State University

  2. Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work

  3. Introduction • Sparse Matrix-Vector Multiplication (SpMV) • y = Ax, where A is a sparse matrix and x is a dense vector. • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods. • Focus of much research • Scientific Applications, e.g. finite element method • Graph Mining algorithms • PageRank, Random Walk with Restart, HITS • Industrial Strength Efforts • CPUs, Clusters • GPUs (focus of this talk)

  4. Why GPUs • High Performance • Figure on right • High Productivity • CUDA (now) vs. OpenGL (other complications)

  5. Background: CUDA Architecture • Programming Model (logical hierarchy): • Grid • Block • Thread • Kernel

  6. Background: CUDA Architecture • Hardware (Physical): • A set of multiprocessors • 8 processors and 1 Instruction Unit • A warp = 32 threads, concurrently run the same instructions • Conditional divergence • Different warps: time-shared • Memory System • Global memory: coalescing • Shared memory: 16KB per block • Constant/Texture memory • Constant value, cached • 16KB constant cache; 6~8KB texture cache per multiprocessor • Registers

  7. Power-law Graphs and Challenges • Power-law graphs • Large number of nodes with low degree • Few nodes with very high degree • Challenges for GPU based computation of SpMV on such graphs • Load balancing • Inefficient memory access • Conditional divergence • Problem Statement • Can we do better than competing industrial strength efforts for processing matrices representing such graphs? • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?

  8. Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work

  9. Key Insights from Benchmarking Texture cache size was not available Estimated to be 250 KB (=64,000 columns) Note entire X cannot fit on texture cache • Three kinds of memory accesses • Accesses to A, Accesses to x, and Writes to y • Previous methods have optimized accesses to A. • Observation 1: Each row accesses random elements in vector x • Observation 2: The accesses are non-coalesced – poor locality • Solution 1: Tile A by columns and store x on the texture cache

  10. Tiling • Observation 3: Column lengths are power-law distribution • Many short columns, little re-use of x in these columns • No benefit from tiling • Solution 2: Reorder by column length and partially tile A • Un-tiled elements are computed separately.

  11. Composite Storage of Tiles • Observation 4: Rows (non-zeros) in each tile follow power law! • Observation 5: Within each tile performance is limited by • load imbalance • non-coalesced global memory accesses • conditional thread divergence  serialization • Solution 3: Composite tile storage scheme • Basic observation from benchmarking study • Row major storage performs well on long rows (16 threads per row). • Column major storage performs well on short rows (1 thread per row).

  12. Row and Column Composite Storage • Reorder the rows in each tile from long to short. • Rows are partitioned into workloads with similar size. • A thread warp is assigned to compute one workload. • A workload is a rectangle area of non-zeros with width w and height h • If w > h, row major storage • Else, column major storage

  13. Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work

  14. Experiments • Hardware Configuration • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory • All experiments are run with single process single GPU. Datasets

  15. SpMV Kernel Power-law matrices

  16. SpMV Kernel Unstructured matrices: non-power-law

  17. Data Mining Applications • Given directed graph G = (V, E) , and adjacency matrix A • PageRank: • W is row normalization of A • c = 0.85, U is a n by n matrix with all elements set to 1/n. • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i. • W is column normalization of A • c = 0.9, the ith element in is 1, the others are all 0. • HITS: each web page is assigned an authority score and a hub score.

  18. PageRank

  19. Outline • Motivation and Background • Methods • Experiments • Conclusions and Future work

  20. Conclusions and Future Work • Architecture conscious optimizations for SpMV • Architecture features of GPU • Characteristics of graph mining applications • Significant performance improvement on power-law graph datasets. • Future work • Parameter auto-tuning based on non-zero distribution • Blocking and loop unrolling • Extension to distributed systems to handle larger datasets.

  21. Thank you • Questions? • Acknowledgements: • Grants: • DOE Early Career Principal Investigator Award No. DE-FG02-04ER25611 • NSF CAREER Grant IIS-0347662

  22. Backup slides: additional experiment results

  23. Random Walk with Restart

  24. HITS

  25. Outline • Motivation and Background • Limitations of Previous Approach • Methods • Experiments • Conclusions and Future work

  26. Limitations of Previous Work CSR: Imbalanced workload amongst threads, non- coalesced memory accesses. CSR-vector: many short rows, waste of threads • NVIDIA’s SpMV Library based on different storage formats of matrix A. • CSR • CSR kernel • CSR-vector kernel • Optimized CSR-vector Baskaran et al.

  27. Limitation of Previous Work warp0 warp1 COO: thread divergence, low thread level parallelism • COO kernel • Each warp works on one interval • Warps run in parallel • With in one warp, threads do binary reduction, need to check whether two operands are from the same row

  28. Limitation of Previous Work ELL: long rows can’t be bounded HYB: ELL part only covers small amount of computation, COO part is slow, increasing the ratio of ELL part introduces memory overhead. • ELL kernel • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k. • Data and index matrices are stored in column major, each thread works on one row. • HYB kernel: ELL + COO

More Related