190 likes | 335 Views
This resource discusses high-performance processors, focusing on Graphics Processing Units (GPUs) and general-purpose computing on graphics processing units (GPGPU). The course covers execution models, contrasting CPU and GPU designs, and higher-level programming abstractions such as CUDA and Thrust. Key topics include memory architecture, streaming multiprocessors, and kernel launching in CUDA, providing practical examples like vector sorting and summation using the Thrust library. This information is essential for understanding advanced GPU architectures and programming approaches.
E N D
ECE 569 High Performance Processors and Systems • Administrative • HW2 available, due next Thursday 2/13 @ start of class • GPUs • Thrust library • Execution model ECE 569 -- 06 Feb 2014
Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU • Fundamentally different designs: CPU GPU simpler, slower, massively-more cores ECE 569 -- 06 Feb 2014
GPGPU languages • CUDA • OpenCL • Microsoft DirectCompute • … ECE 569 -- 06 Feb 2014
Higher-level Abstractions • Microsoft AMP • Thrust • … ECE 569 -- 06 Feb 2014
Thrust • STL-like approach to GPU programming • A template library for CUDA • Installed as part of CUDA 4.0 and newer • 80-20 rule: make 80% of CUDA programming easier… intmain() { int N = 100; thrust::host_vector<int> h_vec(N); // generate data on host:thrust::generate(h_vec.begin(), h_vec.end(), rand); thrust::device_vector<int> d_vec = h_vec; // copy to device: thrust::sort(d_vec.begin(), d_vec.end()); // sort: thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); // copy back: } ECE 569 -- 06 Feb 2014
Demo • Sorting in Thrust…
Demo • Summing a vector in Thrust… #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/functional.h> #include <stdio.h> intmain() { intN = 100; thrust::host_vector<int> h_vec(N); for(inti = 0; i < N; i++) // fill with 1, 2, 3, ..., N: h_vec[i] = (i+1); thrust::device_vector<int> d_vec = h_vec; // copy to device: intsum = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); printf("** Sum: %d\n\n", sum); return 0; }
High-level GPU Architecture Global Memory Host Memory ECE 569 -- 06 Feb 2014
Streaming Multiprocessor (SM) Streaming Multiprocessor • Streaming Multiprocessor (SM) • 8 Streaming Processors (SP) • 2 Super Function Units (SFU) • Multi-threaded instruction dispatch • 1 to 512 threads active • Shared instruction fetch per 32 threads • Cover latency of texture/memory loads • 20+ GFLOPS • 16 KB shared memory • DRAM texture and memory access Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Global Memory
Device Streaming Multiprocessor N Streaming Multiprocessor 2 Streaming Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory Memory Architecture • The local, global, constant, and texture spaces are regions of device memory (DRAM) • Each multiprocessor has: • A set of 32-bit registers per processor • On-chip shared memory • A read-only constant cache • A read-only texture cache • Data cache (Fermi only) Data Cache, Fermi only Global, constant, texture memories
Terminology device = GPU = set of multiprocessors Streaming Multiprocessor= set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Warp = a subset of a thread block (typically 32) that forms the basic unit of scheduling.
NVIDIA GPU Execution Model I. SIMD Execution of a warp II. Multithreaded Execution across different warps / blocks III. Each thread block mapped to single SM Global Memory
SIMT = Single-Instruction Multiple Threads • Coined by Nvidia • Combines SIMD execution within a warp with SPMD execution across warps
CUDA Thread Block Overview • All threads in a block execute the same kernel program (SPMD) • Programmer declares block: • Block size 1 to 512 concurrent threads • Block shape 1D, 2D, or 3D • Block dimensions in threads • Threads have thread id numbers within block • Thread program uses thread id to select work and address shared data • Threads in the same block share data and synchronize while doing their share of the work • Threads in different blocks cannot cooperate • Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA
Launching a Kernel Function • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3DimGrid(100, 50); // 5000 thread blocks dim3DimBlock(4, 8, 8); // 256 threads per block size_tSharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<<DimGrid, DimBlock, SharedMemBytes>>>(...); Only for data that is not statically allocated
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 Example: Thread Scheduling on G80 • Each Block is executed as 32-thread Warps • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? • A total of 768 threads (max) • Each Block consists of 256/32 = 8 Warps • There are 8 * 3 = 24 Warps … Block 1 Warps … Block 2 Warps … Block 3 Warps … … … Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP
warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • If one global memory access is needed for every 4 instructions… • A minimum of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ...
How is context switching so efficient? Block 0 Thread 0 Register File Block 0 Thread 1 Block 0 Thread 256 • Large register file (16K registers/block) • Each thread assigned a “window” of physical registers • Works if entire thread block’s registers do not exceed capacity (otherwise, compiler fails) • Similarly, shared memory requirements must not exceed capacity for all blocks simultaneously scheduled Block 8 Thread 1 Block 8 Thread 256 Block 8 Thread 0
MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm SM 0 SM 1 Blocks Blocks • Threads run concurrently • SM maintains thread/block id #s • SM manages/schedules thread execution