Fast sparse matrix vector multiplication on gpus implications for graph mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining. Xintian Yang , Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science The Ohio State University. Outline. Motivation and Background Methods Experiments Conclusions and Future work.

Download Presentation

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Xintian Yang, SrinivasanParthasarathy and P. Sadayappan

Department of Computer Science

The Ohio State University


  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


  • Sparse Matrix-Vector Multiplication (SpMV)

    • y = Ax, where A is a sparse matrix and x is a dense vector.

    • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods.

  • Focus of much research

    • Scientific Applications, e.g. finite element method

    • Graph Mining algorithms

      • PageRank, Random Walk with Restart, HITS

    • Industrial Strength Efforts

      • CPUs, Clusters

      • GPUs (focus of this talk)

Why GPUs

  • High Performance

    • Figure on right

  • High Productivity

    • CUDA (now) vs. OpenGL (other complications)

Background: CUDA Architecture

  • Programming Model (logical hierarchy):

    • Grid

    • Block

    • Thread

    • Kernel

Background: CUDA Architecture

  • Hardware (Physical):

    • A set of multiprocessors

    • 8 processors and 1 Instruction Unit

    • A warp = 32 threads, concurrently run the same instructions

    • Conditional divergence

    • Different warps: time-shared

  • Memory System

    • Global memory: coalescing

    • Shared memory: 16KB per block

    • Constant/Texture memory

      • Constant value, cached

      • 16KB constant cache; 6~8KB texture cache per multiprocessor

    • Registers

Power-law Graphs and Challenges

  • Power-law graphs

    • Large number of nodes with low degree

    • Few nodes with very high degree

  • Challenges for GPU based

    computation of SpMV on such graphs

    • Load balancing

    • Inefficient memory access

    • Conditional divergence

  • Problem Statement

    • Can we do better than competing industrial strength efforts for processing matrices representing such graphs?

    • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?


  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work

Key Insights from Benchmarking

Texture cache size was not available

Estimated to be 250 KB (=64,000 columns)

Note entire X cannot fit on texture cache

  • Three kinds of memory accesses

    • Accesses to A, Accesses to x, and Writes to y

    • Previous methods have optimized accesses to A.

      • Observation 1: Each row accesses random elements in vector x

      • Observation 2: The accesses are non-coalesced – poor locality

    • Solution 1: Tile A by columns and store x on the texture cache


  • Observation 3: Column lengths are power-law distribution

    • Many short columns, little re-use of x in these columns

      • No benefit from tiling

    • Solution 2: Reorder by column length and partially tile A

    • Un-tiled elements are computed separately.

Composite Storage of Tiles

  • Observation 4: Rows (non-zeros) in each tile follow power law!

  • Observation 5: Within each tile performance is limited by

    • load imbalance

    • non-coalesced global memory accesses

    • conditional thread divergence  serialization

  • Solution 3: Composite tile storage scheme

    • Basic observation from benchmarking study

      • Row major storage performs well on long rows (16 threads per row).

      • Column major storage performs well on short rows (1 thread per row).

Row and Column Composite Storage

  • Reorder the rows in each tile from long to short.

  • Rows are partitioned into workloads with similar size.

  • A thread warp is assigned to compute one workload.

  • A workload is a rectangle area of non-zeros with width w and height h

    • If w > h, row major storage

    • Else, column major storage


  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


  • Hardware Configuration

    • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory

    • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory

    • All experiments are run with single process single GPU.


SpMV Kernel

Power-law matrices

SpMV Kernel

Unstructured matrices: non-power-law

Data Mining Applications

  • Given directed graph G = (V, E) , and adjacency matrix A

  • PageRank:

    • W is row normalization of A

    • c = 0.85, U is a n by n matrix with all elements set to 1/n.

  • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i.

    • W is column normalization of A

    • c = 0.9, the ith element in is 1, the others are all 0.

  • HITS: each web page is assigned an authority score and a hub score.



  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work

Conclusions and Future Work

  • Architecture conscious optimizations for SpMV

    • Architecture features of GPU

    • Characteristics of graph mining applications

  • Significant performance improvement on power-law graph datasets.

  • Future work

    • Parameter auto-tuning based on non-zero distribution

    • Blocking and loop unrolling

    • Extension to distributed systems to handle larger datasets.

Thank you

  • Questions?

  • Acknowledgements:

    • Grants:

      • DOE Early Career Principal Investigator Award

        No. DE-FG02-04ER25611

      • NSF CAREER Grant IIS-0347662

Backup slides: additional experiment results

Random Walk with Restart



  • Motivation and Background

  • Limitations of Previous Approach

  • Methods

  • Experiments

  • Conclusions and Future work

Limitations of Previous Work

CSR: Imbalanced workload amongst threads, non-

coalesced memory accesses.

CSR-vector: many short rows, waste of threads

  • NVIDIA’s SpMV Library based on different storage formats of matrix A.

    • CSR

      • CSR kernel

      • CSR-vector kernel

      • Optimized CSR-vector

        Baskaran et al.

Limitation of Previous Work



COO: thread divergence, low thread level parallelism

  • COO kernel

    • Each warp works on one interval

    • Warps run in parallel

    • With in one warp, threads do binary reduction, need to check whether two operands are from the same row

Limitation of Previous Work

ELL: long rows can’t be bounded

HYB: ELL part only covers small amount of computation,

COO part is slow, increasing the ratio of ELL part

introduces memory overhead.

  • ELL kernel

    • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k.

    • Data and index matrices are stored in column major, each thread works on one row.

  • HYB kernel: ELL + COO

  • Login