1 / 26

A short introduction to nVidia‘s CUDA

A short introduction to nVidia‘s CUDA. Alexander Heinecke Technical University of Munich. http://home.in.tum.de/~heinecke/fa2007. Overview. Differences CPU – GPU 3 General CPU/GPU properties Compare specifications CUDA Programming Model 10 Application stack Thread implementation

aizza
Download Presentation

A short introduction to nVidia‘s CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich http://home.in.tum.de/~heinecke/fa2007

  2. Overview • Differences CPU – GPU 3 • General CPU/GPU properties • Compare specifications • CUDA Programming Model 10 • Application stack • Thread implementation • Memory Model • CUDA API 13 • Extension of the C/C++ Programming Lang. • Example structure of a CUDA application • Examples 15 • Matrix Addition • Matrix Multiplication • Jacobi & Gauß – Seidel • Benchmark Results 21

  3. Differences between CPU and GPU • GPU: nearly all transistors are ALUs • CPU: most of the transistors are Cache (taken from [NV1])

  4. AMD Opteron Dieshot

  5. Intel Itanium2 Dual-Core Dieshot

  6. Intel Core Architecture Pipeline / Simple Example (taken from IN1) Pipeline RET #1 RET #2 RET #3 Step 5 EXEC #1 EXEC #2 EXEC #3 EXEC #4 Step 4 OFETCH #1 OFETCH #2 OFETCH #3 OFETCH #4 OFETCH #5 Step 3 IDEC #1 IDEC #2 IDEC #3 IDEC #4 IDEC #5 IDEC #6 Step 2 IFETCH #1 IFETCH #2 IFETCH #3 IFETCH #4 IFETCH #5 IFETCH #6 IFETCH #7 Step 1 cycle 1 2 3 4 5 6 7

  7. nVidia G80 Pipeline

  8. Properties of CPU and GPU

  9. History: Power of GPUs in the last four years (taken from [NV1])

  10. Application stack of CUDA (taken from [NV1])

  11. Thread organization in CUDA (taken from [NV1])

  12. Memory organization in CUDA (taken from [NV1])

  13. Extensions to C (functions and varaible) • CUDA Code is saved in special files (*.cu) • These are precompiled by nvcc (nvidia compiler) • There are some function type qualifiers, which decide the execution place: • __host__ (CPU only, called by CPU) • __global__ (GPU only, called by CPU) • __device__ (GPU only, called by GPU) • For varaibles: __device__, __constant__, __shared__

  14. Example structure of a CUDA application • min. two functions to isolate CUDA Code from your app. • First function: • Init CUDA • Copy data to device • Call kernel with execution settings • Copy data to host and shut down (automatic) • Second function (kernel): • Contains problem for ONE thread

  15. Tested Algorithms (2D Arrays) All tested algorithms operate on 2D Arrays • Matrix Addtion • Matrix Multiplication • Jacobi & Gauß-Seidel (iterative solver)

  16. Example Matrix Addition (Init function) CUT_DEVICE_INIT(); // allocate device memory float* d_A; CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size)); … // copy host memory to device CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size, cudaMemcpyHostToDevice) ); … cudaBindTexture(0, texRef_MaA, d_A, mem_size);// texture binding … dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU); dim3 grid(n_dim / threads.x, n_dim / threads.y); // execute the kernel cuMatrixAdd_kernel<<< grid, threads >>>(d_C, n_dim); cudaUnbindTexture(texRef_MaA);// texture unbinding … // copy result from device to host CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size, cudaMemcpyDeviceToHost) ); … CUDA_SAFE_CALL(cudaFree(d_A));

  17. Example Matrix Addition (kernel) // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU; C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);

  18. Example Matrix Multiplication (kernel) int tx2 = tx + BLOCK_SIZE_GPU; int ty2 = n_dim * ty; float Csub1 = 0.0;float Csub2 = 0.0; int b = bBegin; for (int a = aBegin; a <= aEnd; a += aStep) { __shared__float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU]; AS(ty, tx) = A[a + ty2 + tx]; __shared__float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2]; B1S(ty, tx) = B[b + ty2 + tx]; B1S(ty, tx2) = B[b + ty2 + tx2]; __syncthreads(); Csub1 += AS(ty, 0) * B1S(0, tx); // more calcs b+= bStep; } __syncthreads(); // Write result back

  19. Example Jacobi (kernel), no internal loops // Block index int bx = blockIdx.x;int by = blockIdx.y; // Thread index int tx = threadIdx.x+1; int ty = threadIdx.y+1; int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU); floatres = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh; res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1); res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx); res = 0.25f * res; ma_u[ustart + (ty * n_dim) + tx] = res;

  20. Example Jacobi (kernel), internal loops int tx = threadIdx.x+1; int ty = threadIdx.y+1; // *some more inits* // load to calc u_ij __shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2]; US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx); // *init edge u* … for (unsigned int i = 0; i < n_intern_loops; i++) { res = funk; res += US(ty, tx - 1) + US(ty, tx + 1); res += US(ty - 1, tx) + US(ty + 1, tx); res = 0.25f * res; __syncthreads();// not used in parallel jacobi US(ty, tx) = res; } ma_u[ustart + (ty * n_dim) + tx] = res;

  21. Performance Results (1)

  22. Performance Results (2)

  23. Performance Results (3)

  24. Performance Results (4)

  25. Conclusion (Points to take care of) Be care of / you should use: • min. number of memory accesses • use unrolling instead of for loops • use blocking algorithms • only algorithms, which are not extremly memory bounded (NOT matrix addition) should be implemented with CUDA • try to do not use the if statement, or other programmecontrolling statements (slow)

  26. Appendix - References [NV1] NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version 1.0, 23.06.2007 [IN1/2/3] Intel Architecture Handbook, Version November 2006 [NR] Numerical receipies (online generated pdf) http://home.in.tum.de/~heinecke/fa2007

More Related