Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs Matt Mukerjee David Naylor

Parallelism on GPUs • $100 NVIDIA video card  192 cores • (Build Blacklight for ~$2000 ???) • Incredibly low power • Ubiquitous • Question: Use for general computation? • General Purpose GPU (GPGPU) ? =

GPU Hardware • Very specific constraints • Designed to be SIMD (e.g. shaders) • Zero-overhead thread scheduling • Little caching (compared to CPUs) • Constantly stalled on memory access • MASSIVE # of threads / core • Much finer-grained threads (“kernels”)

CUDA Architecture

Thread Blocks • GPUs are SIMD • How does multithreading work? • Threads that branch are halted, then run • Single Instruction Multiple….?

CUDA is an SIMT architecture • Single Instruction Multiple Thread • Threads in a block execute the same instruction Multi-threaded Instruction Unit

Observation Fitting the data structures needed by the threads in one multiprocessor requires application-specific tuning.

Example: MapReduce on CUDA Too big for cache on one SM!

Problem Only one code branchwithin a block executes at a time

Enhancing SIMT

Problem If two multiprocessors share a cache line, there are more memory accesses than necessary.

Data Reordering

Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Exploiting Disruptive Technology: GPUs for Physics

Exploiting Parallelism

List Ranking on GPUs

Qilin : Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information

Janus : exploiting parallelism via hindsight

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Physical Simulation on GPUs

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting SIMD parallelism with the CGiS compiler framework

Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

On-chip Parallelism

On-chip Parallelism

Exploiting Vector Parallelism in Software Pipelined Loops

Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Parallelism

Weak Execution Ordering - Exploiting Iterative Methods on Many-Core GPUs

Exploiting Parallelism

Exploiting Vector Parallelism in Software Pipelined Loops