CUDA

CUDA Slides by David Kirk

What is GPGPU ? • General Purpose computation using GPUin applications other than 3D graphics • GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Low-latency floating point (FP) computation • Applications – see //GPGPU.org • Game effects (FX) physics, image processing • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Previous GPGPU Constraints per thread per Shader per Context • Dealing with graphics API • Working with the corner cases of the graphics API • Addressing modes • Limited texture size/dimension • Shader capabilities • Limited outputs • Instruction sets • Lack of Integer & bit ops • Communication limited • Between pixels • Scatter a[i] = p Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

CUDA • “Compute Unified DeviceArchitecture” • General purpose programming model • User kicks off batches of threads on the GPU • GPU = dedicated super-threaded, massively data parallel co-processor • Targeted software stack • Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU • Standalone Driver - Optimized for computation • Interface designed for compute - graphics free API • Data sharing with OpenGL buffer objects • Guaranteed maximum download & readback speeds • Explicit GPU memory management

Parallel Computing on a GPU • NVIDIA GPU Computing Architecture • Via a separate HW interface • In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications • GPU parallelism is doubling every year • Programming model scales transparently • Programmable in C with CUDA tools • Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 Tesla S870

Extended C __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch

CUDA Programming Model:A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that: • Is a coprocessor to the CPU or host • Has its own DRAM (device memory) • Runs many threadsin parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 2) Thread (0, 1) Thread (0, 0) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 2) Thread (4, 0) Thread Batching: Grids and Blocks • A kernel is executed as a grid of thread blocks • All threads share data memory space • A thread block is a batch of threads that can cooperate with each other by: • Synchronizing their execution • For hazard-free shared memory accesses • Efficiently sharing data through a low latency shared memory • Two threads from two different blocks cannot cooperate Courtesy: NDVIA

Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 2) Thread (4, 1) Block and Thread IDs • Threads and blocks have IDs • So each thread can decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • … Courtesy: NDVIA

(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Space Overview • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories(Long Latency Accesses) (Device) Grid • Global memory • Main means of communicating R/W Data between host and device • Contents visible to all threads • Texture and Constant Memories • Constants initialized by host • Contents visible to all threads Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Courtesy: NDVIA

CUDA – API

CUDA Highlights:Easy and Lightweight • The API is an extension to the ANSI C programming language Low learning curve • The hardware is designed to enablelightweight runtime and driver High performance

CUDA Device Memory Allocation (Device) Grid • cudaMalloc() • Allocates object in the device Global Memory • Requires two parameters • Address of a pointer to the allocated object • Size of of allocated object • cudaFree() • Frees object from device Global Memory • Pointer to freed object Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory

CUDA Device Memory Allocation(cont.) • Code example: • Allocate a 64 * 64 single precision float array • Attach the allocated storage to Md.elements • “d” is often used to indicate a device data structure BLOCK_SIZE = 64; float * d_matrix; int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&d_matrix, size); cudaFree(d_matrix);

(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Host-Device Data Transfer • cudaMemcpy() • memory data transfer • Requires four parameters • Pointer to source • Pointer to destination • Number of bytes copied • Type of transfer • Host to Host • Host to Device • Device to Host • Device to Device • Asynchronous in CUDA 1.1

CUDA Host-Device Data Transfer(cont.) • Code example: • Transfer a 64 * 64 single precision float array • M is in host memory and Md is in device memory • cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(h_matrix, d_matrix, size, cudaMemcpyHostToDevice); cudaMemcpy(h_matrx, d_matrix, size, cudaMemcpyDeviceToHost);

Calling a Kernel Function – Thread Creation • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is asynchronous from CUDA 1.1 on, explicit synch needed for blocking

Memory Model

GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 Why Use the GPU for Computing ? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit floating point precision • It offers lots of GFLOPS: • GPU in every PC and workstation

Control ALU ALU ALU ALU DRAM Cache DRAM What is Behind such an Evolution? • The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control • The fast-growing video game industry exerts strong economic pressure that forces constant innovation CPU GPU

split personality n. Two distinct personalities in the same entity, each of which prevails at a particular time.

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Texture Texture Texture Texture Texture Texture Texture Texture Texture L1 L1 L1 L1 L1 L1 L1 L1 Host Host Input Assembler Input Assembler Setup / Rstr / ZCull Thread Execution Manager Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache L2 L2 L2 L2 L2 L2 Load/store Load/store Load/store Load/store Load/store Load/store FB FB FB FB FB FB Global Memory G80 Thread Computing Pipeline • The future of GPUs is programmable processing • So – build the architecture around the processor • Processors execute computing threads • Alternative operating mode specifically for computing

GeForce 8800 Series TechnicalSpecs • Maximum number of threads per block: 512 • Maximum size of each dimension of a grid: 65,535 • Number of streaming multiprocessors (SM): • GeForce 8800 GTX: 16 @ 675 MHz • GeForce 8800 GTS: 12 @ 600 MHz • Device memory: • GeForce 8800 GTX: 768 MB • GeForce 8800 GTS: 640 MB • Shared memory per multiprocessor: 16KB divided in 16 banks • Constant memory: 64 KB • Warp size: 32 threads (16 Warps/Block)

What is the GPU Good at? • The GPU is good at data-parallel processing • The same computation executed on many data elements in parallel – low control flow overhead with high SP floating point arithmetic intensity • Many calculations per memory access • Currently also need high floating point to integer ratio • High floating-point arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches – Still need to avoid bandwidth saturation!

ALU ALU ALU ALU ALU ALU ... ... Control Control … Cache Cache DRAM … d0 d1 d3 d4 d5 d7 d2 d6 Drawbacks of (legacy) GPGPU Model: Hardware Limitations • Memory accesses are done as pixels • Only gather: can read data from other pixels • No scatter: (Can only write to one pixel) Less programming flexibility ALU ALU ALU ALU ALU ALU ... ... Control Control … Cache Cache DRAM … d0 d1 d3 d4 d5 d7 d2 d6

ALU ALU ALU ALU ALU ALU ... ... Control Control Cache Cache DRAM … d0 d1 d3 d4 d5 d7 d2 d6 Drawbacks of (legacy) GPGPU Model: Hardware Limitations • Applications can easily be limited by DRAM memory bandwidth Waste of computation power due to data starvation

ALU ALU ALU ALU ALU ALU ... ... Control Control … Cache Cache DRAM … d0 d1 d3 d4 d5 d7 d2 d6 ALU ALU ALU ALU ALU ALU ... ... Control Control … Cache Cache DRAM … d0 d1 d3 d4 d5 d7 d2 d6 CUDA Highlights: Scatter • CUDA provides generic DRAM memory addressing • Gather: • And scatter: no longer limited to write one pixel More programming flexibility

ALU ALU ALU ALU ALU ALU ... ... Control Control Cache Cache … Shared memory Shared memory d4 d0 d5 d1 d3 d7 d2 d6 DRAM … d0 d4 d1 d5 d7 d3 d6 d2 CUDA Highlights:On-Chip Shared Memory • CUDA enables access to a parallel on-chip shared memory for efficient inter-thread data sharing Big memory bandwidth savings

A Common Programming Pattern • Local and global memory reside in device memory (DRAM) - much slower access than shared memory • So, a profitable way of performing computation on the device is to block data to take advantage of fast shared memory: • Partition datainto data subsets that fit into shared memory • Handle each data subset with one thread block by: • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element • Copying results from shared memory to global memory

A Common Programming Pattern (cont.) • Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory • But… cached! • Highly efficient access for read-only data • Carefully divide data according to access patterns • R/O no structure  constant memory • R/O array structured  texture memory • R/W shared within Block  shared memory • R/W registers spill to local memory • R/W inputs/results  global memory

Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Instruction Unit … Processor 1 Processor 2 Processor M G80 Hardware Implementation:A Set of SIMD Multiprocessors • The device is a set of 16 multiprocessors • Each multiprocessor is a set of 32-bit processors with a Single Instruction Multiple Data architecture – shared instruction unit • At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp • The number of threads in a warp is the warp size

Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory Hardware Implementation:Memory Architecture • The local, global, constant, and texture spaces are regions of device memory • Each multiprocessor has: • A set of 32-bit registers per processor • On-chip shared memory • Where the shared memory space resides • A read-only constant cache • To speed up access to the constant memory space • A read-only texture cache • To speed up access to the texture memory space Global, constant, texture memories

Hardware Implementation: Execution Model (review) • Each thread block of a grid issplit into warps, each gets executed by one multiprocessor (SM) • The device processes only one grid at a time • Each thread block is executed by one multiprocessor • So that the shared memory space resides in the on-chip shared memory • A multiprocessor can executemultiple blocks concurrently • Shared memory and registers are partitioned among the threads of all concurrent blocks • So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

Threads, Warps, Blocks • There are (up to) 32 threads in a Warp • Only <32 when there are fewer than 32 total threads • There are (up to) 16 Warps in a Block • Each Block (and thus, each Warp) executes on a single SM • G80 has 16 SMs • At least 16 Blocks required to “fill” the device • More is better • If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM

Language Extensions:Built-in Variables • dim3 gridDim; • Dimensions of the grid in blocks (gridDim.z unused) • dim3 blockDim; • Dimensions of the block in threads • dim3 blockIdx; • Block index within the grid • dim3 threadIdx; • Thread index within the block

Common Runtime Component • Provides: • Built-in vector types • A subset of the C runtime library supported in both host and device codes

Common Runtime Component:Built-in Vector Types • [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] • Structures accessed with x, y, z, w fields: uint4 param; int y = param.y; • dim3 • Based on uint3 • Used to specify dimensions

Common Runtime Component:Mathematical Functions • pow, sqrt, cbrt, hypot • exp, exp2, expm1 • log, log2, log10, log1p • sin, cos, tan, asin, acos, atan, atan2 • sinh, cosh, tanh, asinh, acosh, atanh • ceil, floor, trunc, round • Etc. • When executed on the host, a given function uses the C runtime implementation if available • These functions are only supported for scalar types, not vector types

Host Runtime Component:Memory Management • Device memory allocation • cudaMalloc(), cudaFree() • Memory copy from host to device, device to host, device to device • cudaMemcpy(), cudaMemcpy2D(), cudaMemcpyToSymbol(), cudaMemcpyFromSymbol() • Memory addressing • cudaGetSymbolAddress()

Device Runtime Component:Mathematical Functions • Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) • __pow • __log, __log2, __log10 • __exp • __sin, __cos, __tan

Device Runtime Component:Synchronization Function • void __syncthreads(); • Synchronizes all threads in a block • Once all threads have reached this point, execution resumes normally • Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory • Allowed in conditional constructs only if the conditional is uniform across the entire thread block

Some Useful Information on Tools

Compilation • Any source file containing CUDA language extensions must be compiled with nvcc • nvcc is a compiler driver • Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... • nvcc can output: • Either C code • That must then be compiled with the rest of the application using another tool • Or object code directly

Linking • Any executable with CUDA code requires two dynamic libraries: • The CUDA runtime library (cudart) • The CUDA core library (cuda)

Debugging Using theDevice Emulation Mode • An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime • No need of any device and CUDA driver • Each device thread is emulated with a host thread • When running in device emulation mode, one can: • Use host native debug support (breakpoints, inspection, etc.) • Access any device-specific data from host code and vice-versa • Call any host function from device code (e.g. printf) and vice-versa • Detect deadlock situations caused by improper usage of __syncthreads

Device Emulation Mode Pitfalls • Emulated device threads execute sequentially, so simultaneous accessesof the same memorylocation by multiple threads could produce different results. • Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode • Results of floating-point computations will slightly differ because of: • Different compiler outputs, instruction sets • Use of extended precision for intermediate results • There are various options to force strict single precision on the host

CUDA

CUDA

Presentation Transcript

CUDA Programming,

Cuda

CUDA

CUDA

CUDA Lecture 8 CUDA Memories

CUDA

CUDA Lecture 4 CUDA Programming Basics

CUDA Programming

CUDA

CUDA 5.0

CUDA

CUDA