A short introduction to nVidia‘s CUDA

A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich http://home.in.tum.de/~heinecke/fa2007

Overview • Differences CPU – GPU 3 • General CPU/GPU properties • Compare specifications • CUDA Programming Model 10 • Application stack • Thread implementation • Memory Model • CUDA API 13 • Extension of the C/C++ Programming Lang. • Example structure of a CUDA application • Examples 15 • Matrix Addition • Matrix Multiplication • Jacobi & Gauß – Seidel • Benchmark Results 21

Differences between CPU and GPU • GPU: nearly all transistors are ALUs • CPU: most of the transistors are Cache (taken from [NV1])

AMD Opteron Dieshot

Intel Itanium2 Dual-Core Dieshot

Intel Core Architecture Pipeline / Simple Example (taken from IN1) Pipeline RET #1 RET #2 RET #3 Step 5 EXEC #1 EXEC #2 EXEC #3 EXEC #4 Step 4 OFETCH #1 OFETCH #2 OFETCH #3 OFETCH #4 OFETCH #5 Step 3 IDEC #1 IDEC #2 IDEC #3 IDEC #4 IDEC #5 IDEC #6 Step 2 IFETCH #1 IFETCH #2 IFETCH #3 IFETCH #4 IFETCH #5 IFETCH #6 IFETCH #7 Step 1 cycle 1 2 3 4 5 6 7

nVidia G80 Pipeline

Properties of CPU and GPU

History: Power of GPUs in the last four years (taken from [NV1])

Application stack of CUDA (taken from [NV1])

Thread organization in CUDA (taken from [NV1])

Memory organization in CUDA (taken from [NV1])

Extensions to C (functions and varaible) • CUDA Code is saved in special files (*.cu) • These are precompiled by nvcc (nvidia compiler) • There are some function type qualifiers, which decide the execution place: • __host__ (CPU only, called by CPU) • __global__ (GPU only, called by CPU) • __device__ (GPU only, called by GPU) • For varaibles: __device__, __constant__, __shared__

Example structure of a CUDA application • min. two functions to isolate CUDA Code from your app. • First function: • Init CUDA • Copy data to device • Call kernel with execution settings • Copy data to host and shut down (automatic) • Second function (kernel): • Contains problem for ONE thread

Tested Algorithms (2D Arrays) All tested algorithms operate on 2D Arrays • Matrix Addtion • Matrix Multiplication • Jacobi & Gauß-Seidel (iterative solver)

Example Matrix Addition (Init function) CUT_DEVICE_INIT(); // allocate device memory float* d_A; CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size)); … // copy host memory to device CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size, cudaMemcpyHostToDevice) ); … cudaBindTexture(0, texRef_MaA, d_A, mem_size);// texture binding … dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU); dim3 grid(n_dim / threads.x, n_dim / threads.y); // execute the kernel cuMatrixAdd_kernel<<< grid, threads >>>(d_C, n_dim); cudaUnbindTexture(texRef_MaA);// texture unbinding … // copy result from device to host CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size, cudaMemcpyDeviceToHost) ); … CUDA_SAFE_CALL(cudaFree(d_A));

Example Matrix Addition (kernel) // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU; C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);

Example Matrix Multiplication (kernel) int tx2 = tx + BLOCK_SIZE_GPU; int ty2 = n_dim * ty; float Csub1 = 0.0;float Csub2 = 0.0; int b = bBegin; for (int a = aBegin; a <= aEnd; a += aStep) { __shared__float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU]; AS(ty, tx) = A[a + ty2 + tx]; __shared__float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2]; B1S(ty, tx) = B[b + ty2 + tx]; B1S(ty, tx2) = B[b + ty2 + tx2]; __syncthreads(); Csub1 += AS(ty, 0) * B1S(0, tx); // more calcs b+= bStep; } __syncthreads(); // Write result back

Example Jacobi (kernel), no internal loops // Block index int bx = blockIdx.x;int by = blockIdx.y; // Thread index int tx = threadIdx.x+1; int ty = threadIdx.y+1; int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU); floatres = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh; res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1); res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx); res = 0.25f * res; ma_u[ustart + (ty * n_dim) + tx] = res;

Example Jacobi (kernel), internal loops int tx = threadIdx.x+1; int ty = threadIdx.y+1; // *some more inits* // load to calc u_ij __shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2]; US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx); // *init edge u* … for (unsigned int i = 0; i < n_intern_loops; i++) { res = funk; res += US(ty, tx - 1) + US(ty, tx + 1); res += US(ty - 1, tx) + US(ty + 1, tx); res = 0.25f * res; __syncthreads();// not used in parallel jacobi US(ty, tx) = res; } ma_u[ustart + (ty * n_dim) + tx] = res;

Performance Results (1)

Conclusion (Points to take care of) Be care of / you should use: • min. number of memory accesses • use unrolling instead of for loops • use blocking algorithms • only algorithms, which are not extremly memory bounded (NOT matrix addition) should be implemented with CUDA • try to do not use the if statement, or other programmecontrolling statements (slow)

Appendix - References [NV1] NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version 1.0, 23.06.2007 [IN1/2/3] Intel Architecture Handbook, Version November 2006 [NR] Numerical receipies (online generated pdf) http://home.in.tum.de/~heinecke/fa2007

A short introduction to nVidia‘s CUDA