1 / 9

CUBLAS and CUSPARSE MVM Timing

CUBLAS and CUSPARSE MVM Timing. Gavin Harrison. SMVM Algorithm. NVIDIA Memory Hierarchy. Global Memory: large/high latency. Shared Memory: shared cache for each set of processors. Constant/texture memory: read only in global memory + on chip cache. Constant memory faster, but only one port.

willow
Download Presentation

CUBLAS and CUSPARSE MVM Timing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUBLAS and CUSPARSE MVM Timing Gavin Harrison

  2. SMVM Algorithm

  3. NVIDIA Memory Hierarchy • Global Memory: large/high latency. • Shared Memory: shared cache for each set of processors. • Constant/texture memory: read only in global memory + on chip cache. • Constant memory faster, but only one port. • Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.

  4. Tuning SMVM for GPU (GT 280) • Use multiple threads / row, use syncthreads and combine partial results. • Access memory at stride. • Half warps access sequential addresses. • Allows for fewer memory reads from global memory. • Align rows. • Also helps decrease memory reads from global memory. • Use texture memory for input vector. • Input vector is reused. • Texture reads are cached, and benefit from spacial locality.

  5. Improvements in Fermi (GTX 580) • General L1/L2 cache structure. • L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). • L2 is 768 KB. • Improved support for double precision floating point numbers. • Added support for 32 bit integer multiplication. • 32 SPs per SM.

  6. CUSPARSE SMVM Performance

  7. CUSPARSE SMVM Speedup Over OSKI (single precision)

  8. CUBLAS MVM Performance

  9. CUBLAS MVM Speedup over ATLAS

More Related