Introduction to GPU Programming

Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

Tutorial Goals • Become familiar with NVIDIA GPU architecture • Become familiar with the NVIDIA GPU application development flow • Be able to write and run a simple NVIDIA GPU application in CUDA

Tutorial Outline • Introduction (15 minutes) • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture • NVIDIA • GPU programming • Libraries, CUDA C, OpenCL, PGI x64+GPU • Hands-on: getting started with NCSA GPU cluster (15 minutes) • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application • Hands-on: Anatomy of a GPU application (25 minutes) • Host side • Device side • CUDA programming model • Hands-on: Porting matrix multiplier to GPU (25 minutes) • More on CUDA programming (40 minutes)

Introduction • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture • NVIDIA • GPU programming • Libraries • CUDA C • OpenCL • PGI x64+GPU

Why GPUs?Raw Performance Trends 7800 GTX 6800 Ultra 5950 Ultra 5800 Graph is courtesy of NVIDIA

Why GPUs?Memory Bandwidth Trends GTX 285 GTX 280 8800 Ultra 8800 GTX 7800 GTX 7900 GTX 5950 Ultra 6800 Ultra 5800 Graph is courtesy of NVIDIA

GPU vs. CPU Silicon Use Graph is courtesy of NVIDIA

NVIDIA GPU Architecture • A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of • 8 Scalar Processor (SP) cores • 2 special function units for transcendentals • A multithreaded instruction unit • On-chip shared memory • GDDR3 SDRAM • PCIe interface Figure is courtesy of NVIDIA

NVIDIA GeForce9400M G GPU • 16 streaming processors arranged as 2 streaming multiprocessors • At 0.8 GHz this provides • 54 GFLOPS in single-precision (SP) • 128-bit interface to off-chip GDDR3 memory • 21 GB/s bandwidth TPC Geometry controller SMC SM SM I cache I cache MT issue MT issue C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU Shared memory Shared memory Texture units Texture L1 128-bit interconnect L2 ROP ROP L2 DRAM DRAM

NVIDIA Tesla C1060 GPU • 240 streaming processors arranged as 30 streaming multiprocessors • At 1.3 GHz this provides • 1 TFLOPS SP • 86.4 GFLOPS DP • 512-bit interface to off-chip GDDR3 memory • 102 GB/s bandwidth TPC 1 TPC 10 Geometry controller Geometry controller SMC SMC SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Shared memory Shared memory Texture units Texture units Texture L1 Texture L1 512-bit memory interconnect ROP L2 L2 ROP DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

NVIDIA Tesla S1070 Computing Server 4 GB GDDR3 SDRAM • 4 T10 GPUs 4 GB GDDR3 SDRAM Tesla GPU Tesla GPU Power supply NVIDIA SWITCH PCI x16 Thermal management PCI x16 NVIDIA SWITCH Tesla GPU Tesla GPU System monitoring 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM Graph is courtesy of NVIDIA

GPU Use/Programming • GPU libraries • NVIDIA’s CUDA BLAS and FFT libraries • Many 3rd party libraries • Low abstraction lightweight GPU programming toolkits • CUDA C • OpenCL • High abstraction compiler-based tools • PGI x64+GPU

Getting Started with NCSA GPU Cluster • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application

NCSA AC GPU Cluster

GPU Cluster Architecture • Servers: 32 • CPU cores: 128 • Accelerator Units: 32 • GPUs: 128 ac (head node) ac01 (compute node) ac02 (compute node) ac32 (compute node)

GPU Cluster Node Architecture • HP xw9400 workstation • 2216 AMD Opteron 2.4 GHz dual socket dual core • 8 GB DDR2 • InfiniBand QDR • S1070 1U GPU Computing Server • 1.3 GHz Tesla T10 processors • 4x4 GB GDDR3 SDRAM IB Compute node QDR IB HP xw9400 workstation PCIe x16 PCIe x16 Tesla S1070 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM

Accessing the GPU Cluster • Use Secure Shell (SSH) client to access AC • `sshUSER@ac.ncsa.uiuc.edu` (User: gpu001 - gpu200; Password: CHiPS-09) • You will see something like this printed out: See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/ Machine Description and HOW TO USE. See: /usr/local/share/ac.readme CUDA wrapper readme: /usr/local/share/cuda_wrapper.readme *IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!! … July 03, 2009 Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs. Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu [gpuXYZ@ac ~]$ _

Installing Tutorial Examples • Run this sequence to retrieve and install tutorial examples: cd cp /tmp/chips_tutorial.tgz . tar -xvzf chips_tutorial.tgz cdchips_tutorial ls src1 src2 src3

Accessing the GPU Cluster Laptop 1 Laptop 2 Laptop 30 ac (head node) You are here ac01 (compute node) ac02 (compute node) ac32 (compute node)

Requesting a Cluster Node for Interactive Use • Run `qstat` to see what other users do, just for the fun of it • Run `qsub -I -lwalltime=02:00:00` to request a node with a single GPU for 2 hours of interactive use • You will see something like this printed out: qsub: waiting for job 64424.acm to start qsub: job 64424.acm ready [gpuXYZ@acAB ~]$ _

Requesting a Cluster Node Laptop 1 Laptop 2 Laptop 30 ac (head node) You are here ac01 (compute node) ac02 (compute node) ac32 (compute node)

Checking GPU Characteristics • Run `deviceQuery` CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes … Clock rate: 1.30 GHz Compute mode: Exclusive (only one host thread at a time can use this device)

Compiling and Running and Existing Application • cdchips_tutorial/src1 • vecadd.c - reference C implementation • vecadd.cu – CUDA implementation • Compile & run CPU version gccvecadd.c -o vecadd_cpu ./vecadd_cpu Running CPU vecAdd for 16384 elements C[0]=2147483648.00 ... • Compile & run GPU version nvcc vecadd.cu -o vecadd_gpu ./vecadd_gpu Running GPU vecAdd for 16384 elements C[0]=2147483648.00 ...

nvcc • Any source file containing CUDA C language extensions must be compiled with nvcc • nvcc is a compiler driver that invokes many other tools to accomplish the job • Basic nvcc usage • nvcc <filename>.cu [-o <executable>] • Builds release mode • nvcc -deviceemu <filename>.cu • Builds device emulation mode (all code runs on CPU) • -g flag allows to build debug mode for gdb debugger

Anatomy of a GPU Application • Host side • Device side • CUDA programming model

CPU-Only Version void vecAdd(int N, float* A, float* B, float* C) { for (inti = 0; i < N; i++) C[i] = A[i] + B[i]; } int main(intargc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); } Computational kernel Memory allocation Kernel invocation Memory de-allocation

Adding GPU support int main(intargc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); Copy data from the CPU (host) memory to the GPU (device) memory Memory allocation on the GPU card

Adding GPU support vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); } Device memory de-allocation Copy results from device memory to the host memory Kernel invocation

GPU Kernel • CPU version void vecAdd(int N, float* A, float* B, float* C) { for (inti = 0; i < N; i++) C[i] = A[i] + B[i]; } • GPU version __global__ void vecAdd(float* A, float* B, float* C) { inti = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }

threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … CUDA Programming Model • A CUDA kernel is executed by an array of threads • All threads run the same code (SPMD)‏ • Each thread has an ID that it uses to compute memory addresses and make control decisions • Threads are arranged as a grid of thread blocks • Threads within ablock have accessto a segment ofshared memory Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared memory Shared memory Shared memory

Kernel Invocation Syntax grid & thread block dimensionality vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC); Grid Thread Block 0 Thread Block 1 Thread Block N-1 inti = blockIdx.x * blockDim.x + threadIdx.x; … block ID within a grid number of theards per block Shared memory Shared memory Shared memory thread ID within a thread block

Kernel grid Device Block 6 Block 4 Block 2 Block 0 Block 5 Block 1 Block 3 Block 7 Device Block 6 Block 5 Block 1 Block 5 Block 4 Block 4 Block 3 Block 2 Block 0 Block 6 Block 7 Block 1 Block 2 Block 3 Block 7 Block 0 Mapping Threads to the Hardware • Blocks of threads are transparently assigned to SMs • A block of threads executes on one SM & does not migrate • Several blocks can reside concurrently on one SM time Each block can execute in any order relative to other blocks. Slide is courtesy of NVIDIA

(0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 1) Block (2, 0) Grid 2 CUDA Programming Model Device • A kernel is executed as a grid of thread blocks • Grid of blocks can be 1 or 2-dimentional • Thread blocks can be 1, 2, or 3-dimensional • Different kernels can have different grid/block configuration • Threads from the same block have access to a shared memory and their execution can be synchronized Host Kernel 1 Block (1, 1) Kernel 2 Slide is courtesy of NVIDIA

GPU Memory Hierarchy Host Device GPU CPU DRAM local global constant texture Multiprocessor Multiprocessor Multiprocessor chipset registers shared memory DRAM constant and texture caches

Porting matrix multiplier to GPU • cd ../chips_tutorial/src2 • Compile & run CPU version icc -O3 mmult.c -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 2215478 GFLOPS = 0.969

int main(intargc, char* argv[]) { int N = 1024; structtimeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL); mprint(a, N, 5); free(a); free(b); free(c); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); } // a = b * c void mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (inti = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (inti = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; } } void mprint(float *a, int N, int M) { inti, j; for (int j = 0; j < M; j++) { for (inti = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n"); }

Matrix Representation in Memory for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; • Matrices are stored in column-major order • For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)

C Map this code: for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; B A=B*C a1,2=b1,1*c1,2+b1,2*c2,2 into this (logical) architecture: Grid of thread blocks blockIdx.x blockDim.x 0 0 1 2 3 4 5 1 0 1 2 3 4 5 2 0 1 2 3 4 5 threadIdx.x blockIdx.x * blockDim.x + threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

dim3 grid(1024/32, 1024); dim3 threads (32); 32x1024 grid of thread blocks Block of 32x1x1 threads (blockIdx.x, blockIdx.y) (threadIdx.x) { inti = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum; } 1024 thread blocks (j) 32 threads per block (i) 32 thread blocks (i)

Kernel Original CPU kernel GPU Kernel __global__ void mmult(float *a, float *b, float *c, int N) { inti = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j]; a[i+N*j] = sum; } void mmult(float *a, float *b, float *c, int N) { for (inti = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } dim3 dimGrid(32, 1024); dim3 dimBlock(32); mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);

int main(intargc, char* argv[]) { int N = 1024; structtimeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);

gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; // define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;

// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c); msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); }

Porting matrix multiplier to GPU • Compile & run GPU version nvcc mmult.cu -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 91363 GFLOPS = 23.505

More on CUDA Programming • Language extensions • Function type qualifiers • Variable type qualifiers • Execution configuration • Built-in variables • Common runtime components • Built-in vector types • Device runtime components • Intrinsic functions • Synchronization and memory fencing functions • Atomic functions • Host runtime components (runtime API only) • Device management • Memory management • Error handling • Debugging in the device emulation mode • Exercise

Executed on the: Only callable from the: __device__ float DeviceFunc()‏ device device __global__ void KernelFunc()‏ device host __host__ float HostFunc()‏ host host Function Type Qualifiers __device__ and __global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments __device__ functions cannot have their address taken __host__ and __device__ qualifiers can be used together, in which case the function is compiled for both __global__ and __host__ qualifiers cannot be used together __global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous

Variable Type Qualifiers __shared__ and __constant__ variables have implied static storage __device__, __shared__ and __constant__ variables cannot be defined using external keyword __device__ and __constant__ variables are only allowed at file scope __constant__ variables cannot be assigned to from the devices, they are initialized from the host only __shared__ variables cannot have an initialization as part of their declaration

Execution Configuration • Function declared as • __global__ void kernel(float* param); • must be called like this: • kernel<<<Dg, Db, Ns, S>>>(param); • where • Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched; • Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block; • optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory • optional S (type cudaStream_t) specifies the stream associated with this kernel call

Built-in Variables It is not allowed to take addresses of any of the built-in variables It is not allowed to assign values to any of the built-in variables

Built-in Vector Types • Vector types derived from basic integer and float types • char1, char2, char3, char4 • uchar1, uchar2, uchar3, uchar4 • short1, short2, short3, short4 • ushort1, ushort2, ushort3, ushort4 • int1, int2, int3, int4 • uint1, uint2, uint3 (dim3), uint4 • long1, long2, long3, long4 • ulong1, ulong2, ulong3, ulong4 • longlong1, longlong2 • float1, float2, float3, float4 • double1, double2 They are all structures, like this: typedefstruct { float x,y,z,w; } float4; They all come with a constructor function in the form make_<type name>, e.g., int2 make_int2(int x, int y);

Introduction to GPU Programming

Introduction to GPU Programming

Presentation Transcript

GPU programming

GPU programming: CUDA

GPU Programming

CS179: GPU Programming

CUDA GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

GPU Programming

GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

GPU Programming Overview

GPU Programming with CUDA – Introduction Paul Richmond

CS101c GPU Programming

GPU Programming Paradigms

GPU Programming “Languages”

GPU Programming

CS179: GPU Programming

GPU Programming Overview