CUBLAS and CUSPARSE MVM Timing

CUBLAS and CUSPARSE MVM Timing Gavin Harrison

SMVM Algorithm

NVIDIA Memory Hierarchy • Global Memory: large/high latency. • Shared Memory: shared cache for each set of processors. • Constant/texture memory: read only in global memory + on chip cache. • Constant memory faster, but only one port. • Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.

Tuning SMVM for GPU (GT 280) • Use multiple threads / row, use syncthreads and combine partial results. • Access memory at stride. • Half warps access sequential addresses. • Allows for fewer memory reads from global memory. • Align rows. • Also helps decrease memory reads from global memory. • Use texture memory for input vector. • Input vector is reused. • Texture reads are cached, and benefit from spacial locality.

Improvements in Fermi (GTX 580) • General L1/L2 cache structure. • L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). • L2 is 768 KB. • Improved support for double precision floating point numbers. • Added support for 32 bit integer multiplication. • 32 SPs per SM.

CUSPARSE SMVM Performance

CUSPARSE SMVM Speedup Over OSKI (single precision)

CUBLAS MVM Performance

CUBLAS MVM Speedup over ATLAS

CUBLAS and CUSPARSE MVM Timing

CUBLAS and CUSPARSE MVM Timing

Presentation Transcript

Timing

Timing

TIMING AND SUBSTITUTIONS

MVM

Timing and Synchronization

CUBLAS Library

Timing

Timing

Timing

Timing

Timing and Hazards

Sources and Timing

Timing Analysis and Timing Predictability

Timing and Constraints

Flowchart and Timing

Timing

Rhythm and timing

Timing and synchronisation

Timing and Interference

Timing