1 / 12

Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs. Matt Mukerjee David Naylor. Parallelism on GPUs. $100 NVIDIA video card  192 cores (Build Blacklight for ~$2000 ???) Incredibly low power Ubiquitous Question: Use for general computation? General Purpose GPU (GPGPU). ?. =. GPU Hardware.

vail
Download Presentation

Exploiting Parallelism on GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Parallelism on GPUs Matt Mukerjee David Naylor

  2. Parallelism on GPUs • $100 NVIDIA video card  192 cores • (Build Blacklight for ~$2000 ???) • Incredibly low power • Ubiquitous • Question: Use for general computation? • General Purpose GPU (GPGPU) ? =

  3. GPU Hardware • Very specific constraints • Designed to be SIMD (e.g. shaders) • Zero-overhead thread scheduling • Little caching (compared to CPUs) • Constantly stalled on memory access • MASSIVE # of threads / core • Much finer-grained threads (“kernels”)

  4. CUDA Architecture

  5. Thread Blocks • GPUs are SIMD • How does multithreading work? • Threads that branch are halted, then run • Single Instruction Multiple….?

  6. CUDA is an SIMT architecture • Single Instruction Multiple Thread • Threads in a block execute the same instruction Multi-threaded Instruction Unit

  7. Observation Fitting the data structures needed by the threads in one multiprocessor requires application-specific tuning.

  8. Example: MapReduce on CUDA Too big for cache on one SM!

  9. Problem Only one code branchwithin a block executes at a time

  10. Enhancing SIMT

  11. Problem If two multiprocessors share a cache line, there are more memory accesses than necessary.

  12. Data Reordering

More Related