Fast sparse matrix vector multiplication on gpus implications for graph mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining. Xintian Yang , Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science The Ohio State University. Outline. Motivation and Background Methods Experiments Conclusions and Future work.

Download Presentation

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fast sparse matrix vector multiplication on gpus implications for graph mining

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Xintian Yang, SrinivasanParthasarathy and P. Sadayappan

Department of Computer Science

The Ohio State University


Outline

Outline

  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


Introduction

Introduction

  • Sparse Matrix-Vector Multiplication (SpMV)

    • y = Ax, where A is a sparse matrix and x is a dense vector.

    • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods.

  • Focus of much research

    • Scientific Applications, e.g. finite element method

    • Graph Mining algorithms

      • PageRank, Random Walk with Restart, HITS

    • Industrial Strength Efforts

      • CPUs, Clusters

      • GPUs (focus of this talk)


Why gpus

Why GPUs

  • High Performance

    • Figure on right

  • High Productivity

    • CUDA (now) vs. OpenGL (other complications)


Background cuda architecture

Background: CUDA Architecture

  • Programming Model (logical hierarchy):

    • Grid

    • Block

    • Thread

    • Kernel


Background cuda architecture1

Background: CUDA Architecture

  • Hardware (Physical):

    • A set of multiprocessors

    • 8 processors and 1 Instruction Unit

    • A warp = 32 threads, concurrently run the same instructions

    • Conditional divergence

    • Different warps: time-shared

  • Memory System

    • Global memory: coalescing

    • Shared memory: 16KB per block

    • Constant/Texture memory

      • Constant value, cached

      • 16KB constant cache; 6~8KB texture cache per multiprocessor

    • Registers


Power law graphs and challenges

Power-law Graphs and Challenges

  • Power-law graphs

    • Large number of nodes with low degree

    • Few nodes with very high degree

  • Challenges for GPU based

    computation of SpMV on such graphs

    • Load balancing

    • Inefficient memory access

    • Conditional divergence

  • Problem Statement

    • Can we do better than competing industrial strength efforts for processing matrices representing such graphs?

    • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?


Outline1

Outline

  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


Key insights from benchmarking

Key Insights from Benchmarking

Texture cache size was not available

Estimated to be 250 KB (=64,000 columns)

Note entire X cannot fit on texture cache

  • Three kinds of memory accesses

    • Accesses to A, Accesses to x, and Writes to y

    • Previous methods have optimized accesses to A.

      • Observation 1: Each row accesses random elements in vector x

      • Observation 2: The accesses are non-coalesced – poor locality

    • Solution 1: Tile A by columns and store x on the texture cache


Tiling

Tiling

  • Observation 3: Column lengths are power-law distribution

    • Many short columns, little re-use of x in these columns

      • No benefit from tiling

    • Solution 2: Reorder by column length and partially tile A

    • Un-tiled elements are computed separately.


Composite storage of tiles

Composite Storage of Tiles

  • Observation 4: Rows (non-zeros) in each tile follow power law!

  • Observation 5: Within each tile performance is limited by

    • load imbalance

    • non-coalesced global memory accesses

    • conditional thread divergence  serialization

  • Solution 3: Composite tile storage scheme

    • Basic observation from benchmarking study

      • Row major storage performs well on long rows (16 threads per row).

      • Column major storage performs well on short rows (1 thread per row).


Row and column composite storage

Row and Column Composite Storage

  • Reorder the rows in each tile from long to short.

  • Rows are partitioned into workloads with similar size.

  • A thread warp is assigned to compute one workload.

  • A workload is a rectangle area of non-zeros with width w and height h

    • If w > h, row major storage

    • Else, column major storage


Outline2

Outline

  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


Experiments

Experiments

  • Hardware Configuration

    • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory

    • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory

    • All experiments are run with single process single GPU.

Datasets


Spmv kernel

SpMV Kernel

Power-law matrices


Fast sparse matrix vector multiplication on gpus implications for graph mining

SpMV Kernel

Unstructured matrices: non-power-law


Fast sparse matrix vector multiplication on gpus implications for graph mining

Data Mining Applications

  • Given directed graph G = (V, E) , and adjacency matrix A

  • PageRank:

    • W is row normalization of A

    • c = 0.85, U is a n by n matrix with all elements set to 1/n.

  • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i.

    • W is column normalization of A

    • c = 0.9, the ith element in is 1, the others are all 0.

  • HITS: each web page is assigned an authority score and a hub score.


Fast sparse matrix vector multiplication on gpus implications for graph mining

PageRank


Outline3

Outline

  • Motivation and Background

  • Methods

  • Experiments

  • Conclusions and Future work


Conclusions and future work

Conclusions and Future Work

  • Architecture conscious optimizations for SpMV

    • Architecture features of GPU

    • Characteristics of graph mining applications

  • Significant performance improvement on power-law graph datasets.

  • Future work

    • Parameter auto-tuning based on non-zero distribution

    • Blocking and loop unrolling

    • Extension to distributed systems to handle larger datasets.


Thank you

Thank you

  • Questions?

  • Acknowledgements:

    • Grants:

      • DOE Early Career Principal Investigator Award

        No. DE-FG02-04ER25611

      • NSF CAREER Grant IIS-0347662


Backup slides additional experiment results

Backup slides: additional experiment results


Fast sparse matrix vector multiplication on gpus implications for graph mining

Random Walk with Restart


Fast sparse matrix vector multiplication on gpus implications for graph mining

HITS


Fast sparse matrix vector multiplication on gpus implications for graph mining

Outline

  • Motivation and Background

  • Limitations of Previous Approach

  • Methods

  • Experiments

  • Conclusions and Future work


Limitations of previous work

Limitations of Previous Work

CSR: Imbalanced workload amongst threads, non-

coalesced memory accesses.

CSR-vector: many short rows, waste of threads

  • NVIDIA’s SpMV Library based on different storage formats of matrix A.

    • CSR

      • CSR kernel

      • CSR-vector kernel

      • Optimized CSR-vector

        Baskaran et al.


Limitation of previous work

Limitation of Previous Work

warp0

warp1

COO: thread divergence, low thread level parallelism

  • COO kernel

    • Each warp works on one interval

    • Warps run in parallel

    • With in one warp, threads do binary reduction, need to check whether two operands are from the same row


Limitation of previous work1

Limitation of Previous Work

ELL: long rows can’t be bounded

HYB: ELL part only covers small amount of computation,

COO part is slow, increasing the ratio of ELL part

introduces memory overhead.

  • ELL kernel

    • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k.

    • Data and index matrices are stored in column major, each thread works on one row.

  • HYB kernel: ELL + COO


  • Login