Introduction to High-Performance Processors and GPGPU Programming Techniques

ECE 569 High Performance Processors and Systems • Administrative • HW2 available, due next Thursday 2/13 @ start of class • GPUs • Thrust library • Execution model ECE 569 -- 06 Feb 2014

Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU • Fundamentally different designs: CPU GPU simpler, slower, massively-more cores ECE 569 -- 06 Feb 2014

GPGPU languages • CUDA • OpenCL • Microsoft DirectCompute • … ECE 569 -- 06 Feb 2014

Higher-level Abstractions • Microsoft AMP • Thrust • … ECE 569 -- 06 Feb 2014

Thrust • STL-like approach to GPU programming • A template library for CUDA • Installed as part of CUDA 4.0 and newer • 80-20 rule: make 80% of CUDA programming easier… intmain() { int N = 100; thrust::host_vector<int> h_vec(N); // generate data on host:thrust::generate(h_vec.begin(), h_vec.end(), rand); thrust::device_vector<int> d_vec = h_vec; // copy to device: thrust::sort(d_vec.begin(), d_vec.end()); // sort: thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); // copy back: } ECE 569 -- 06 Feb 2014

Demo • Sorting in Thrust…

Demo • Summing a vector in Thrust… #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/functional.h> #include <stdio.h> intmain() { intN = 100; thrust::host_vector<int> h_vec(N); for(inti = 0; i < N; i++) // fill with 1, 2, 3, ..., N: h_vec[i] = (i+1); thrust::device_vector<int> d_vec = h_vec; // copy to device: intsum = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); printf("** Sum: %d\n\n", sum); return 0; }

High-level GPU Architecture Global Memory Host Memory ECE 569 -- 06 Feb 2014

Streaming Multiprocessor (SM) Streaming Multiprocessor • Streaming Multiprocessor (SM) • 8 Streaming Processors (SP) • 2 Super Function Units (SFU) • Multi-threaded instruction dispatch • 1 to 512 threads active • Shared instruction fetch per 32 threads • Cover latency of texture/memory loads • 20+ GFLOPS • 16 KB shared memory • DRAM texture and memory access Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Global Memory

Device Streaming Multiprocessor N Streaming Multiprocessor 2 Streaming Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory Memory Architecture • The local, global, constant, and texture spaces are regions of device memory (DRAM) • Each multiprocessor has: • A set of 32-bit registers per processor • On-chip shared memory • A read-only constant cache • A read-only texture cache • Data cache (Fermi only) Data Cache, Fermi only Global, constant, texture memories

Terminology device = GPU = set of multiprocessors Streaming Multiprocessor= set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Warp = a subset of a thread block (typically 32) that forms the basic unit of scheduling.

NVIDIA GPU Execution Model I. SIMD Execution of a warp II. Multithreaded Execution across different warps / blocks III. Each thread block mapped to single SM Global Memory

SIMT = Single-Instruction Multiple Threads • Coined by Nvidia • Combines SIMD execution within a warp with SPMD execution across warps

CUDA Thread Block Overview • All threads in a block execute the same kernel program (SPMD) • Programmer declares block: • Block size 1 to 512 concurrent threads • Block shape 1D, 2D, or 3D • Block dimensions in threads • Threads have thread id numbers within block • Thread program uses thread id to select work and address shared data • Threads in the same block share data and synchronize while doing their share of the work • Threads in different blocks cannot cooperate • Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

Launching a Kernel Function • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3DimGrid(100, 50); // 5000 thread blocks dim3DimBlock(4, 8, 8); // 256 threads per block size_tSharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<<DimGrid, DimBlock, SharedMemBytes>>>(...); Only for data that is not statically allocated

t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 Example: Thread Scheduling on G80 • Each Block is executed as 32-thread Warps • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? • A total of 768 threads (max) • Each Block consists of 256/32 = 8 Warps • There are 8 * 3 = 24 Warps … Block 1 Warps … Block 2 Warps … Block 3 Warps … … … Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • If one global memory access is needed for every 4 instructions… • A minimum of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ...

How is context switching so efficient? Block 0 Thread 0 Register File Block 0 Thread 1 Block 0 Thread 256 • Large register file (16K registers/block) • Each thread assigned a “window” of physical registers • Works if entire thread block’s registers do not exceed capacity (otherwise, compiler fails) • Similarly, shared memory requirements must not exceed capacity for all blocks simultaneously scheduled Block 8 Thread 1 Block 8 Thread 256 Block 8 Thread 0

MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm SM 0 SM 1 Blocks Blocks • Threads run concurrently • SM maintains thread/block id #s • SM manages/schedules thread execution

Introduction to High-Performance Processors and GPGPU Programming Techniques

Introduction to High-Performance Processors and GPGPU Programming Techniques

Presentation Transcript

High Performance Sorting and Searching using Graphics Processors

ECE/CS 552: Parallel Processors

High Performance Work Systems (HPWS)

Adaptive Cache Compression for High-Performance Processors

Creating High-Performance Work Systems

Delayered High Performance work systems

ECE 569 Database System Engineering Spring 2004

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems

High Performance Processors and Systems

ECE 569: High-Performance Computing: Architectures, Algorithms and Technologies

ECE 569 Database System Engineering Fall 2004

Challenges for High Performance Processors

Advanced Topic: High Performance Processors

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *

ECE 569 Database System Engineering Spring 2003

High Performance Processors

High Performance Discrete Fourier Transforms on Graphics Processors

ECE 569 Database System Engineering Spring 2003