fast sparse matrix vector multiplication on gpus implications for graph mining
Download
Skip this Video
Download Presentation
Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Loading in 2 Seconds...

play fullscreen
1 / 28

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining. Xintian Yang , Srinivasan Parthasarathy and P. Sadayappan Department of Computer Science The Ohio State University. Outline. Motivation and Background Methods Experiments Conclusions and Future work.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining' - keefe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fast sparse matrix vector multiplication on gpus implications for graph mining

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Xintian Yang, SrinivasanParthasarathy and P. Sadayappan

Department of Computer Science

The Ohio State University

outline
Outline
  • Motivation and Background
  • Methods
  • Experiments
  • Conclusions and Future work
introduction
Introduction
  • Sparse Matrix-Vector Multiplication (SpMV)
    • y = Ax, where A is a sparse matrix and x is a dense vector.
    • Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods.
  • Focus of much research
    • Scientific Applications, e.g. finite element method
    • Graph Mining algorithms
      • PageRank, Random Walk with Restart, HITS
    • Industrial Strength Efforts
      • CPUs, Clusters
      • GPUs (focus of this talk)
why gpus
Why GPUs
  • High Performance
    • Figure on right
  • High Productivity
    • CUDA (now) vs. OpenGL (other complications)
background cuda architecture
Background: CUDA Architecture
  • Programming Model (logical hierarchy):
    • Grid
    • Block
    • Thread
    • Kernel
background cuda architecture1
Background: CUDA Architecture
  • Hardware (Physical):
    • A set of multiprocessors
    • 8 processors and 1 Instruction Unit
    • A warp = 32 threads, concurrently run the same instructions
    • Conditional divergence
    • Different warps: time-shared
  • Memory System
    • Global memory: coalescing
    • Shared memory: 16KB per block
    • Constant/Texture memory
      • Constant value, cached
      • 16KB constant cache; 6~8KB texture cache per multiprocessor
    • Registers
power law graphs and challenges
Power-law Graphs and Challenges
  • Power-law graphs
    • Large number of nodes with low degree
    • Few nodes with very high degree
  • Challenges for GPU based

computation of SpMV on such graphs

    • Load balancing
    • Inefficient memory access
    • Conditional divergence
  • Problem Statement
    • Can we do better than competing industrial strength efforts for processing matrices representing such graphs?
    • Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?
outline1
Outline
  • Motivation and Background
  • Methods
  • Experiments
  • Conclusions and Future work
key insights from benchmarking
Key Insights from Benchmarking

Texture cache size was not available

Estimated to be 250 KB (=64,000 columns)

Note entire X cannot fit on texture cache

  • Three kinds of memory accesses
    • Accesses to A, Accesses to x, and Writes to y
    • Previous methods have optimized accesses to A.
      • Observation 1: Each row accesses random elements in vector x
      • Observation 2: The accesses are non-coalesced – poor locality
    • Solution 1: Tile A by columns and store x on the texture cache
tiling
Tiling
  • Observation 3: Column lengths are power-law distribution
    • Many short columns, little re-use of x in these columns
      • No benefit from tiling
    • Solution 2: Reorder by column length and partially tile A
    • Un-tiled elements are computed separately.
composite storage of tiles
Composite Storage of Tiles
  • Observation 4: Rows (non-zeros) in each tile follow power law!
  • Observation 5: Within each tile performance is limited by
    • load imbalance
    • non-coalesced global memory accesses
    • conditional thread divergence  serialization
  • Solution 3: Composite tile storage scheme
    • Basic observation from benchmarking study
      • Row major storage performs well on long rows (16 threads per row).
      • Column major storage performs well on short rows (1 thread per row).
row and column composite storage
Row and Column Composite Storage
  • Reorder the rows in each tile from long to short.
  • Rows are partitioned into workloads with similar size.
  • A thread warp is assigned to compute one workload.
  • A workload is a rectangle area of non-zeros with width w and height h
    • If w > h, row major storage
    • Else, column major storage
outline2
Outline
  • Motivation and Background
  • Methods
  • Experiments
  • Conclusions and Future work
experiments
Experiments
  • Hardware Configuration
    • GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory
    • CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory
    • All experiments are run with single process single GPU.

Datasets

spmv kernel
SpMV Kernel

Power-law matrices

slide16

SpMV Kernel

Unstructured matrices: non-power-law

slide17

Data Mining Applications

  • Given directed graph G = (V, E) , and adjacency matrix A
  • PageRank:
    • W is row normalization of A
    • c = 0.85, U is a n by n matrix with all elements set to 1/n.
  • Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i.
    • W is column normalization of A
    • c = 0.9, the ith element in is 1, the others are all 0.
  • HITS: each web page is assigned an authority score and a hub score.
outline3
Outline
  • Motivation and Background
  • Methods
  • Experiments
  • Conclusions and Future work
conclusions and future work
Conclusions and Future Work
  • Architecture conscious optimizations for SpMV
    • Architecture features of GPU
    • Characteristics of graph mining applications
  • Significant performance improvement on power-law graph datasets.
  • Future work
    • Parameter auto-tuning based on non-zero distribution
    • Blocking and loop unrolling
    • Extension to distributed systems to handle larger datasets.
thank you
Thank you
  • Questions?
  • Acknowledgements:
    • Grants:
      • DOE Early Career Principal Investigator Award

No. DE-FG02-04ER25611

      • NSF CAREER Grant IIS-0347662
slide25

Outline

  • Motivation and Background
  • Limitations of Previous Approach
  • Methods
  • Experiments
  • Conclusions and Future work
limitations of previous work
Limitations of Previous Work

CSR: Imbalanced workload amongst threads, non-

coalesced memory accesses.

CSR-vector: many short rows, waste of threads

  • NVIDIA’s SpMV Library based on different storage formats of matrix A.
    • CSR
      • CSR kernel
      • CSR-vector kernel
      • Optimized CSR-vector

Baskaran et al.

limitation of previous work
Limitation of Previous Work

warp0

warp1

COO: thread divergence, low thread level parallelism

  • COO kernel
    • Each warp works on one interval
    • Warps run in parallel
    • With in one warp, threads do binary reduction, need to check whether two operands are from the same row
limitation of previous work1
Limitation of Previous Work

ELL: long rows can’t be bounded

HYB: ELL part only covers small amount of computation,

COO part is slow, increasing the ratio of ELL part

introduces memory overhead.

  • ELL kernel
    • Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k.
    • Data and index matrices are stored in column major, each thread works on one row.
  • HYB kernel: ELL + COO
ad