GPU Architecture & Implications

GPU Architecture & Implications David Luebke NVIDIA Research

GPU Architecture • CUDA provides a parallel programming model • The Tesla GPU architecture implements this • This talk will describe the characteristics, goals, and implications of that architecture

G80 GPU Implementation: Tesla C870 • 681 million transistors • 470 mm2 in 90 nm CMOS • 128 thread processors • 518 GFLOPS peak • 1.35 GHz processor clock • 1.5 GB DRAM • 76 GB/s peak • 800 MHz GDDR3 clock • 384 pin DRAM interface • ATX form factor card • PCI Express x16 • 170 W max with DRAM

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM Block Diagram Redux • G80 (launched Nov 2006) • 128 Thread Processors execute kernel threads • Up to 12,288 parallel threads active • Per-block shared memory (PBSM) accelerates processing Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Load/store Global Memory

MT IU SP SM Shared Memory t0 t1 … tB Streaming Multiprocessor (SM) • Processing elements • 8 scalar thread processors (SP) • 32 GFLOPS peak at 1.35 GHz • 8192 32-bit registers (32KB) • ½ MB total register file space! • usual ops: float, int, branch, … • Hardware multithreading • up to 8 blocks resident at once • up to 768 active threads in total • 16KB on-chip memory • low latency storage • shared amongst threads of a block • supports thread communication

Goal: Scalability • Scalable execution • Program must be insensitive to the number of cores • Write one program for any number of SM cores • Program runs on any size GPU without recompiling • Hierarchical execution model • Decompose problem into sequential steps (kernels) • Decompose kernel into computing parallel blocks • Decompose block into computing parallel threads • Hardware distributes independent blocks to SMs as available

Kernel launched by host MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP Device Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Device processor array . . . . . . Blocks Run on Multiprocessors

Goal: easy to program • Strategies: • Familiar programming language mechanics • C/C++ with small extension • Simple parallel abstractions • Simple barrier synchronization • Shared memory semantics • Hardware-managed hierarchy of threads

SM MT IU SP Shared Memory Hardware Multithreading • Hardware allocates resources to blocks • blocks need: thread slots, registers, shared memory • blocks don’t run until resources are available • Hardware schedules threads • threads have their own registers • any thread not waiting for something can run • context switching is (basically) free – every cycle • Hardware relies on threads to hide latency • i.e., parallelism is necessary for performance

Goal: Performance per millimeter • For GPUs, perfomance == throughput • Strategy: hide latency with computation not cache • Heavy multithreading – already discussed by Kevin • Implication: need many threads to hide latency • Occupancy – typically need 128 threads/SM minimum • Multiple thread blocks/SM good to minimize effect of barriers • Strategy: Single Instruction Multiple Thread (SIMT) • Balances performance with ease of programming

SM MT IU SP Shared Memory SIMT Thread Execution • Groups of 32 threads formed into warps • always executing same instruction • shared instruction fetch/dispatch • some become inactive when code path diverges • hardware automatically handles divergence • Warps are the primitive unit of scheduling • pick 1 of 24 warps for each instruction slot • SIMT execution is an implementation choice • sharing control logic leaves more space for ALUs • largely invisible to programmer • must understand for performance, not correctness

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SIMT Multithreaded Execution • Weaving: the original parallel thread technology is about 10,000 years old • Warp: a set of 32 parallel threadsthat execute a SIMD instruction • SM hardware implements zero-overhead warp and thread scheduling • Each SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads • Threads can execute independently • SIMD warp automatically diverges and converges when threads branch • Best efficiency and performance when threads of a warp execute together • SIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency SM multithreaded instruction scheduler time ...

Memory Architecture • Direct load/store access to device memory • treated as the usual linear sequence of bytes (i.e., not pixels) • Texture & constant caches are read-only access paths • On-chip shared memory shared amongst threads of a block • important for communication amongst threads • provides low-latency temporary storage (~100x less than DRAM) Device Memory Shared Memory MT IU SP Constant Cache Texture Cache HostMemory I Cache PCIe

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. • GPUs are power-inefficient • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics NO: CUDA compiles directly to the hardware • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. • GPUs are power-inefficient • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… NO: warps are 32-wide • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. • GPUs are power-inefficient • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… NOPE • …with 4-wide vector registers. • GPUs are power-inefficient • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. NO: scalar thread processors • GPUs are power-inefficient • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. • GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies • GPUs don’t do real floating point

Myths of GPU Computing • GPUs layer normal programs on top of graphics • GPUs architectures are: • Very wide (1000s) SIMD machines… • …on which branching is impossible or prohibitive… • …with 4-wide vector registers. • GPUs are power-inefficient: • GPUs don’t do real floating point

GPU Floating Point Features

Do GPUs Do Real IEEE FP? G8x GPU FP is IEEE 754 Comparable to other processors / accelerators More precise / usable in some ways Less precise in other ways GPU FP getting better every generation Double precision support shortly Goal: best of class by 2009

Questions?David Luebkedluebke@nvidia.com

Applications &Sweet Spots

GPU Computing Sweet Spots • Applications: • High arithmetic intensity: Dense linear algebra, PDEs, n-body, finite difference, … • High bandwidth: Sequencing (virus scanning, genomics), sorting, database… • Visual computing:Graphics, image processing, tomography, machine vision…

Computational Finance GPU Computing Example Markets Computational Chemistry Computational Geoscience Computational Medicine Computational Modeling Computational Biology Computational Science Image Processing

Applications - Condensed 3D image analysis Adaptive radiation therapy Acoustics Astronomy Audio Automobile vision Bioinfomatics Biological simulation Broadcast Cellular automata Computational Fluid Dynamics Computer Vision Cryptography CT reconstruction Data Mining Digital cinema/projections Electromagnetic simulation Equity training Film Financial - lots of areas Languages GIS Holographics cinema Imaging (lots) Mathematics research Military (lots) Mine planning Molecular dynamics MRI reconstruction Multispectral imaging nbody Network processing Neural network Oceanographic research Optical inspection Particle physics • Protein folding • Quantum chemistry • Ray tracing • Radar • Reservoir simulation • Robotic vision/AI • Robotic surgery • Satellite data analysis • Seismic imaging • Surgery simulation • Surveillance • Ultrasound • Video conferencing • Telescope • Video • Visualization • Wireless • X-ray

GPU Computing Sweet Spots • From cluster to workstation • The “personal supercomputing” phase change • From lab to clinic • From machine room to engineer, grad student desks • From batch processing to interactive • From interactive to real-time • GPU-enabled clusters • A 100x or better speedup changes the science • Solve at different scales • Direct brute-force methods may outperform cleverness • New bottlenecks may emerge • Approaches once inconceivable may become practical

New Applications Real-time options implied volatility engine Ultrasound imaging Swaption volatility cube calculator HOOMD Molecular Dynamics Manifold 8 GIS SDK: Mandelbrot, computer vision Also… Image rotation/classification Graphics processing toolbox Microarray data analysis Data parallel primitives Astrophysics simulations Seismic migration

The Future of GPUs GPU Computing drives new applications Reducing “Time to Discovery” 100x Speedup changes science and research methods New applications drive the future of GPUs and GPU Computing Drives new GPU capabilities Drives hunger for more performance Some exciting new domains: Vision, acoustic, and embedded applications Large-scale simulation & physics

Accuracy &Performance

GPU Floating Point Features

Do GPUs Do Real IEEE FP? G8x GPU FP is IEEE 754 Comparable to other processors / accelerators More precise / usable in some ways Less precise in other ways GPU FP getting better every generation Double precision support shortly Goal: best of class by 2009

Performance: BLAS1: 60+ GB/sec BLAS3: 127 GFLOPS FFT: 52 benchFFT* GFLOPS FDTD: 1.2 Gcells/sec SSEARCH: 5.2 Gcells/sec Black Scholes: 4.7 GOptions/sec VMD: 290 GFLOPS How: Leveraging shared memory GPU memory bandwidth GPU GFLOPS performance Custom hardware intrinsics __sinf(), __cosf(), __expf(), __logf(), … CUDA Performance Advantages All benchmarks are compiled code!

GPGPU vs.GPU Computing

Problem: GPGPU • OLD:GPGPU – trick the GPU into general-purpose computing by casting problem as graphics • Turn data into images (“texture maps”) • Turn algorithms into image synthesis (“rendering passes”) • Promising results, but: • Tough learning curve, particularly for non-graphics experts • Potentially high overhead of graphics API • Highly constrained memory layout & access model • Need for many passes drives up bandwidth consumption

Solution: CUDA • NEW:GPU Computing with CUDA • CUDA = Compute Unified Driver Architecture • Co-designed hardware & software for direct GPU computing • Hardware: fully general data-parallel architecture • Software: program the GPU in C • General thread launch • Global load-store • Parallel data cache • Scalar architecture • Integers, bit operations • Double precision (soon) • Scalable data-parallel execution/memory model • C with minimal yet powerful extensions

Graphics Programming Model Graphics Application Vertex Program Rasterization Fragment Program Display

Streaming GPGPU Programming OpenGL Program to Add A and B Start by creating a quad Vertex Program “Programs” created with raster operation Rasterization Read textures as input to OpenGL shader program Fragment Program CPU Reads Texture Memory for Results Write answer to texture memory as a “color” All this just to do A + B

Application Vertex Program Rasterization Pixel Program Display What’s Wrong With GPGPU? Input Registers Pixel Program Texture Constants Temp Registers Output Registers

Application Vertex Program Rasterization Fragment Program Display What’s Wrong With GPGPU? APIs are specific to graphics Input Registers Limited texture size and dimension Fragment Program Texture Limited instruction set No thread communication Constants Temp Registers Limited local storage Output Registers Limited shader outputs No scatter

Input Registers Fragment Program Texture Constants Registers Output Registers Building a Better Pixel

Thread Number Thread Program Texture Constants Registers Output Registers Building a Better Pixel Thread Features • Millions of instructions • Full Integer and Bit instructions • No limits on branching, looping • 1D, 2D, or 3D thread ID allocation

Thread Number Thread Program Texture Constants Registers Global Memory Global Memory Features • Fully general load/store to GPU memory • Untyped, not fixed texture types • Pointer support

Parallel Data Cache Features • Dedicated on-chip memory • Shared between threads for inter-thread communication • Explicitly managed • As fast as registers Thread Number Thread Program Texture Constants Registers Parallel Data Cache Global Memory

Example Algorithm - Fluids Goal: Calculate PRESSURE in a fluid Pressure = Sum of neighboring pressures Pn’ = P1 + P2 + P3 + P4 So the pressure for each particle is… Pressure1 = P1 + P2 + P3 + P4 Pressure2 = P3 + P4 + P5 + P6 Pressure3 = P5 + P6 + P7 + P8 Pressure4 = P7 + P8 + P9 + P10 Pressure depends on neighbors

GPU Computing with CUDA Example Fluid Algorithm CPU GPGPU ThreadExecutionManager ParallelData Cache Control DRAM Cache Control Control P1,P2P3,P4 ALU ALU Shared Data ALU Pn’=P1+P2+P3+P4 P1 P2 P3 P4 Pn’=P1+P2+P3+P4 Control Pn’=P1+P2+P3+P4 P1,P2P3,P4 P1 P2 P3 P4 P5 ALU Control DRAM Pn’=P1+P2+P3+P4 ALU Video Memory Pn’=P1+P2+P3+P4 Single thread out of cache Control P1,P2P3,P4 ALU ALU Pn’=P1+P2+P3+P4 Control Multiple passes through video memory ALU ALU Data/Computation Pn’=P1+P2+P3+P4 Program/Control Parallel execution through cache

GPU Architecture & Implications