Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia

GPUs and CPUs on a collision course • 1st GPUs with programmable shaders in 2001 • Today: TeraFlop on a single card. Turing complete. Highly accessible: senior undergrad students can learn to program CUDA in a few weeks (not good perf. code) • Rapidly growing set of CUDA applications (209 listed on NVIDIA’s CUDA website in February). • With OpenCL safely expect number of non-graphics applications written for GPUs to explode. • GPUs are massively parallel systems: • Multicore + SIMT + fine grain multithreaded

No academic detailed simulator for studying this?!?

GPGPU-Sim • An academic detailed (“cycle-level”) timing simulator developed from the ground up at the University of British Columbia (UBC) for modeling a modern GPU running non-graphics workloads. • Relatively accurate (no effort expended trying to make it more accurate relative to real hardware)

GPGPU-Sim • Currently supports CUDA version 1.1 applications “out of the box”. • Microarchitecture model • Based on notion of “shader cores” which approximate NVIDIA GeForce 8 series and above notion of “Streaming Multiprocessor”. • Connect to memory controllers using a detailed network-on-chip simulator (Dally & Towles’ booksim) • Detailed DRAM timing model (everything except refresh) • GPGPU-Sim v2.0b available: www.gpgpu-sim.org

Rest of this talk • Obligatory brief introduction to CUDA • GPGPU-Sim internals (100,000’ view) • Simulator software overview • Modeled Microarchitecture • Some results from the paper

CUDA Example Runs on CPU main() { … cudaMalloc((void**) &d_idata, bytes); cudaMalloc((void**) &d_odata, maxNumBlocks*sizeof(int)); cudaMemcpy(d_idata, h_idata, bytesin, cudaMemcpyHostToDevice); reduce<<<nthreads, nblocks, smemSize >>>(d_idata, d_odata); cudaThreadSynchronize(); cudaMemcpy(d_odata, h_odata, bytesout, cudaMemcpyDeviceToHost); … } __global__ void reduce(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); for(unsigned int s=1; s < blockDim.x; s *= 2) { if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } nthreads x nblocks copies run in Parallel on GPU

Normal CUDA Flow • Applications written in a mixture of C/C++ and CUDA. • “nvcc” takes CUDA (.cu) files and generates host C code and “Parallel Thread eXecution” assembly language (PTX). • PTX is passed to assembler / optimizer “ptxas” to generate machine code that is packed into a C array (not human readable). • Combine whole thing and link to CUDA runtime API using regular C/C++ compiler linker. • Run your app on the GPU.

GPGPU-Sim Flow • Uses CUDA nvcc to generate CPU C code and PTX. • flex/bison parser reads in PTX. • Link together host (CPU) code and simulator into one binary. • Intercept CUDA API calls using custom libcuda that implements functions declared in header files that come with CUDA.

GPGPU-Sim Microarchitecture • Set of “shader cores” connected to set of memory controllers via a detailed interconnection network model (booksim). • Memory controllers reorder requests to reduce activate /precharge overheads. • Vary topology / bandwidth of interconnect • Cache for global memory operations.

Shader Core Details • Shader core roughly like a “Streaming Multiprocessor” in NVIDIA terminology. • Set of scalar threads grouped together into an SIMD unit called a “warp” (NVIDIA uses 32 on current hardware). Warps grouped into CTAs. CTAs grouped into “grids”. • Set of warps on a core are fine grain interleaved on pipeline to hide off-chip memory access latency. • Threads in one CTA can communicate via an on chip 16KB “shared memory”.

Baseline: Mesh Variations: Crossbar, Ring, Torus Baseline mesh memory controller placement: Interconnection Network

Are more threads better? • More CTAs on a core • Helps hide the latency when some wait for barriers • Can increase memory latency tolerance • Needs more resources • Less CTAs on a core • Less contention in interconnection and memory system

Memory Access Coalescing • Grouping accesses from multiple, concurrently issued, scalar threads into a single access to a contiguous memory region • Is always done for a single warp • Coalescing among multiple warps • We explore its performance benefits • Is more expensive to implement

Simulation setup

Benchmark Selection • Applications developed by 3rd party researchers • Less than 50x reported speedups • + some applications from CUDA SDK

Benchmarks (more info in paper)

Interconnection Network Latency Sensitivity • Slight increase in interconnection latency has no severe effect of overall performance • No need to overdesign interconnection to decrease latency

Interconnection Network Bandwidth Sensitivity • Low Bandwidth decreases performance a lot (8B) • Very high bandwidth moves the bottleneck

Effects of varying number of CTAs • Most benchmarks do not benefit substantially • Some benchmarks even perform better with fewer concurrent threads (e.g. AES) • Less contention in DRAM

More insights and data in the paper…

Summary • GPGPU-Sim: a novel GPU simulator • Capable of simulating CUDA applications • www.gpgpu-sim.org • Performance of simulated applications • More sensitive to bisection BW • Less sensitive to (zero load) Latency • Sometimes running fewer CTAs can improve performance (less DRAM contention)

Interconnect Topology (Fig 9)

ICNT Latency and BW sensitivity (Fig 10-11)

Mem Controller Optimization Effects (Fig 12)

DRAM Utilization and Efficiency (Fig 13 -14)

L1 / L2 Cache (Fig 15)

Varying CTAs (Fig 16)

Inter-Warp Coalescing (Fig 17)

Analyzing CUDA Workloads Using a Detailed GPU Simulator