- 84 Views
- Uploaded on
- Presentation posted in: General

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Xintian Yang, SrinivasanParthasarathy and P. Sadayappan

Department of Computer Science

The Ohio State University

- Motivation and Background
- Methods
- Experiments
- Conclusions and Future work

- Sparse Matrix-Vector Multiplication (SpMV)
- y = Ax, where A is a sparse matrix and x is a dense vector.
- Dominant cost when solving large-scale linear systems or eigenvalue problems in iterative methods.

- Focus of much research
- Scientific Applications, e.g. finite element method
- Graph Mining algorithms
- PageRank, Random Walk with Restart, HITS

- Industrial Strength Efforts
- CPUs, Clusters
- GPUs (focus of this talk)

- High Performance
- Figure on right

- High Productivity
- CUDA (now) vs. OpenGL (other complications)

- Programming Model (logical hierarchy):
- Grid
- Block
- Thread
- Kernel

- Hardware (Physical):
- A set of multiprocessors
- 8 processors and 1 Instruction Unit
- A warp = 32 threads, concurrently run the same instructions
- Conditional divergence
- Different warps: time-shared

- Memory System
- Global memory: coalescing
- Shared memory: 16KB per block
- Constant/Texture memory
- Constant value, cached
- 16KB constant cache; 6~8KB texture cache per multiprocessor

- Registers

- Power-law graphs
- Large number of nodes with low degree
- Few nodes with very high degree

- Challenges for GPU based
computation of SpMV on such graphs

- Load balancing
- Inefficient memory access
- Conditional divergence

- Problem Statement
- Can we do better than competing industrial strength efforts for processing matrices representing such graphs?
- Does it yield end-to-end improvements in graph mining application (e.g. PageRank) ?

- Motivation and Background
- Methods
- Experiments
- Conclusions and Future work

Texture cache size was not available

Estimated to be 250 KB (=64,000 columns)

Note entire X cannot fit on texture cache

- Three kinds of memory accesses
- Accesses to A, Accesses to x, and Writes to y
- Previous methods have optimized accesses to A.
- Observation 1: Each row accesses random elements in vector x
- Observation 2: The accesses are non-coalesced – poor locality

- Solution 1: Tile A by columns and store x on the texture cache

- Observation 3: Column lengths are power-law distribution
- Many short columns, little re-use of x in these columns
- No benefit from tiling

- Solution 2: Reorder by column length and partially tile A
- Un-tiled elements are computed separately.

- Many short columns, little re-use of x in these columns

- Observation 4: Rows (non-zeros) in each tile follow power law!
- Observation 5: Within each tile performance is limited by
- load imbalance
- non-coalesced global memory accesses
- conditional thread divergence serialization

- Solution 3: Composite tile storage scheme
- Basic observation from benchmarking study
- Row major storage performs well on long rows (16 threads per row).
- Column major storage performs well on short rows (1 thread per row).

- Basic observation from benchmarking study

- Reorder the rows in each tile from long to short.
- Rows are partitioned into workloads with similar size.
- A thread warp is assigned to compute one workload.
- A workload is a rectangle area of non-zeros with width w and height h
- If w > h, row major storage
- Else, column major storage

- Motivation and Background
- Methods
- Experiments
- Conclusions and Future work

- Hardware Configuration
- GPU: NVIDIA Tesla C1060, 30 multiprocessors, 240 processor cores, 4GB global memory
- CPU: dual core Opteron 2.6GHz, 8GB of 667 MHz DDR2 main memory
- All experiments are run with single process single GPU.

Datasets

Power-law matrices

SpMV Kernel

Unstructured matrices: non-power-law

Data Mining Applications

- Given directed graph G = (V, E) , and adjacency matrix A
- PageRank:
- W is row normalization of A
- c = 0.85, U is a n by n matrix with all elements set to 1/n.

- Random Walk with Restart (RWR): given a query node i, compute the relevance score from all other nodes to node i.
- W is column normalization of A
- c = 0.9, the ith element in is 1, the others are all 0.

- HITS: each web page is assigned an authority score and a hub score.

PageRank

- Motivation and Background
- Methods
- Experiments
- Conclusions and Future work

- Architecture conscious optimizations for SpMV
- Architecture features of GPU
- Characteristics of graph mining applications

- Significant performance improvement on power-law graph datasets.
- Future work
- Parameter auto-tuning based on non-zero distribution
- Blocking and loop unrolling
- Extension to distributed systems to handle larger datasets.

- Questions?
- Acknowledgements:
- Grants:
- DOE Early Career Principal Investigator Award
No. DE-FG02-04ER25611

- NSF CAREER Grant IIS-0347662

- DOE Early Career Principal Investigator Award

- Grants:

Random Walk with Restart

HITS

Outline

- Motivation and Background
- Limitations of Previous Approach
- Methods
- Experiments
- Conclusions and Future work

CSR: Imbalanced workload amongst threads, non-

coalesced memory accesses.

CSR-vector: many short rows, waste of threads

- NVIDIA’s SpMV Library based on different storage formats of matrix A.
- CSR
- CSR kernel
- CSR-vector kernel
- Optimized CSR-vector
Baskaran et al.

- CSR

warp0

warp1

COO: thread divergence, low thread level parallelism

- COO kernel
- Each warp works on one interval
- Warps run in parallel
- With in one warp, threads do binary reduction, need to check whether two operands are from the same row

ELL: long rows can’t be bounded

HYB: ELL part only covers small amount of computation,

COO part is slow, increasing the ratio of ELL part

introduces memory overhead.

- ELL kernel
- Requires row lengths are bounded by a small number k, 0s are padded if a row is shorter than k.
- Data and index matrices are stored in column major, each thread works on one row.

- HYB kernel: ELL + COO