1 / 96

Optimizations Techniques for GPU Computing

Optimizations Techniques for GPU Computing. Using CUDA. What is Fermi Architecture?. What is Fermi’s SM?. It’s one of 16 Streaming Multi-processors. Each SM Contains 32 CUDA Cores. What’s a CUDA Core?.

aulii
Download Presentation

Optimizations Techniques for GPU Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizations Techniquesfor GPU Computing Using CUDA

  2. What is Fermi Architecture?

  3. What is Fermi’s SM? • It’s one of 16 Streaming Multi-processors. • Each SM Contains 32CUDA Cores

  4. What’s a CUDA Core? • Each of the 32 CUDA Cores work in tandem, and can have up to 48 threads each for a total of 1536 threads per streaming multi-processor. • Threads can be grouped together into thread blocks. • Threads groups are called Warps of no more than 32 threads each (so, up to a total of 48 warps per SM).

  5. Fermi Architecture

  6. How Do You Use CUDA? • CUDA is C with some Extras:

  7. What is <<<THIS?>>>() All About • CUDA is C, C++, and CUDA Specific Keywords • CUDA compiles with NVCC, which uses G++ • All CUDA code is stored in .CU files, but not all code in .CU files run on the GPU. • Why<<<THIS>>>() is how we execute GPU calls.

  8. Host, Global, and Device • Methods use the CUDA keywords __global__, __host__, and __device__, to specify scope and accessibility. • __host__  All code runs on CPU • __global__  Code runs on GPU and is entry point from CPU (must be void) • __device__  Code runs on GPU and can only be called by other methods running on GPU.

  9. Host, Global, and Device: Example __device__ float blahBlahBlah(float param){ return param;} __global__ void blahBlah(args){ printf(blahBlahBlah(3.1415));} __host__ void blah(){ blahBlah<<<16,1024>>>(args);} First argument is number of kernels, second is number of threads.

  10. Global Memory Coalescing Using Cuda

  11. Global Memory Coalescing • Bandwidth, Bandwidth, Bandwidth! • Global Memory • between kernel and global memory • Shared Memory • warp (group of 32 threads) access to shared memory banks

  12. Global Memory Coalescing

  13. Matrix SumGlobal Memory not Coalesced

  14. Matrix SumGlobal Memory Coalesced

  15. Matrix SumGlobal Memory Coalescing

  16. Shared Memory & Bank Conflict Using CUDA

  17. Shared Memory & Bank Conflict • Application – Matrix Multiplication • The size of a bank is 32 bit – no conflict in most cases for • data type: float (32 bit) • Attempt to generate bank conflict and compare the performance differences • data type: char (8 bit) • data type: double (64 bit) • NVidia Compute Capability

  18. Global Memory Implementationkernel code __global__ void kernel(Matrix A, Matrix B, Matrix C) { int y = threadIdx.x + blockIdx.x * blockDim.x; int x = threadIdx.y + blockIdx.y * blockDim.y; float sum = 0; if (x < C.row && y < C.col) { for (int k = 0; k < A.col; k++) { sum += A.GetValue(x, k) * B.GetValue(k, y); } C.SetValue(x, y, sum); } } Shared Memory Implementation shown in the next page

  19. __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { intblockRow = blockIdx.y; intblockCol = blockIdx.x; Matrix Csub = GetSubMatrix(C, blockRow, blockCol); float Cvalue = 0; int row = threadIdx.y; intcol = threadIdx.x; for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; As[row][col] = GetElement(Asub, row, col); Bs[row][col] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue += As[row][e] * Bs[e][col]; } __syncthreads(); } // Write Csub to device memory. Each thread writes one element. SetElement(Csub, row, col, Cvalue); }

  20. Shared Memory Access

  21. Shared Memory Access

  22. Shared Memory Access

  23. Shared Memory Access

  24. Shared Memory Access

  25. Shared Memory Access

  26. Shared Memory Access

  27. Shared Memory Access

  28. Shared Memory Access

  29. Runtime overview– Comparing various data types with basic settings –

  30. Runtime overview– Comparing various data types with basic settings –

  31. 32 bit . . . shared[0] shared[1] shared[2] shared[3] . . . shared[0] shared[1] shared[2] shared[3] . . . shared[31] Implementation for the example: type char • 8-bit and 16-bit accesses typically generate bank conflicts. __shared__ char shared[32]; char data = shared[BaseIndex + tid]; (shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank) char data = shared[BaseIndex + 4 * tid];

  32. Implementation for the example: type char for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ char As[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; __shared__ char Bs[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; As[row * 4][col * 4] = GetElement(Asub, row, col); Bs[row * 4][col * 4] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue+= As[row * 4][e * 4] * Bs[e * 4][col * 4]; } __syncthreads(); }

  33. Slight Improvement… However......

  34. In Hydra, the Performance is Worse

  35. NVidia Compute Capability • Compiler Settings; another factor to consider performance tuning • Compute Capability 1.x (old architecture) • -arch sm_11 • Compute Capability 2.x (Fermi architecture) <Example: Tesla T20 (Hydra), GTX 560 Ti> • -arch sm_20 • The previous run on GTX 560 Ti was compiled in Compute Capability 1.1 • If we compiled Matrix Multiplication app (char) with sm_20 switch...

  36. The new compilation option is much worse...(In Hydra, we don’t see any difference between sm_11 and sm_20)

  37. Matrix Multiplication : type double • A solution (obsolete; using –arch sm_13) __shared__ intshared_lo[32]; __shared__ intshared_hi[32]; shared_lo[BaseIndex + tid] = __double2loint(dataIn); shared_hi[BaseIndex + tid] = __double2hiint(dataIn); double dataOut = __hiloint2double(shared_hi[BaseIndex + tid], shared_lo[BaseIndex + tid]); • For devices of compute capability 2.x, there are no bank conflicts for arrays of double __shared__ double shared[32]; double data = shared[BaseIndex + tid];

  38. Matrix Multiplication : type double

  39. Matrix Multiplication : type double

  40. Conclusion on Bank Conflict in Shared Memory— So far from the data collected — • In today’s hardware, Bank Conflict is becoming increasingly irrelevant as the hardware advances. • Probably because the overhead to attempt to avoid it is greater than the bank conflict itself. • Select a compiler option to suit the application. • -arch sm_xx

  41. General guideline on Shared Memory • The trend in parallel architecture design is towards an inverse memory hierarchy, where the number of registers is increasing compared to cache and shared memory. GTX 560 Ti: • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 524288 bytes Tesla T20 (Hydra): • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 786432 bytes

  42. General guideline on Shared Memory • Contrary to early NVIDIA documentation, shared memory is not as fast as register memory • Most hardware shares L1 cache and Shared Memory • Current shared memory hardware on the Fermi architecture is a step backward [RF01] • Use registers instead of shared memory whenever possible [VV01]

  43. Experimental Analysis of CPU-GPU Data Transfer Optimization using CUDA

  44. Optimizations by CUDA • Pinned (Non-pagable) Memory Optimization • Decrease the time to copy data from CPU-GPU • Optimization Through Multiple Streams. • Hides the transfer time by overlapped execution of kernel and memory transfers.

  45. Pinned Memory(Review) • What is Pinned Memory? • Pinned Memory Enables: • Concurrent data transfer and kernel execution • Mapping of the host memory • Memory bandwidth is higher • Deals with real rather than virtual address • Does not need intermediate copy buffering

  46. Allocating A Page of Locked Memory In CUDA cudaMallocHost ( void ** ptr, size_t size ) • Allocates page-locked host memory that is accessible to device cudaHostAlloc ( void ** ptr, size_t size, unsigned int flags) • Allocates page-locked host memory that is accessible to device – seems to have more options

  47. Experiment on Pinned Memory #define SIZE (10*1024*1024) // number of bytes in arrays 10 MBytes cudaMalloc((void**)&dev_a, SIZE); // allocate memory on device /* ---------------- COPY USING PINNED MEMORY -------------------- */ cudaHostAlloc((void**)&a, SIZE ,cudaHostAllocDefault)// allocate page-locked memory on CPU for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host } /* ---------------- COPY USING REGULAR MEMORY-------------------- */ a = (int*) malloc(SIZE); // allocate regular memory on host for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host }

  48. Results of Pinned vs. Regular Memory on Hydra Time To Copy In (ms) Data transfer in Bytes

  49. Cuda Streams (Review) • A CUDA Stream is a sequence of operations (commands) that are executed in order. • CUDA streams can be created and executed together and interleaved. • Streams proved a mechanism to overlap memory transfer and computations operations

  50. Creating a Stream • Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMemcpyAsync(…, stream1); MyKernel<<< grid, block, stream1>>>(…); cudaMemcpyAsync(… , stream1); Cannot use regular cudaMemcpy with streams, need asynchronous commands for concurrent operation Stream

More Related