960 likes | 1.08k Views
This article provides an overview of GPU computing optimizations using CUDA, focusing on the Fermi architecture. It explains the structure of a Streaming Multiprocessor (SM) consisting of 32 CUDA cores capable of handling a total of 1536 threads. Key aspects include thread organization in warps, memory access patterns for performance enhancement, and the use of global and shared memory. You'll also learn how to structure your CUDA code with specific keywords to optimize your applications, along with practical examples such as matrix multiplication.
E N D
Optimizations Techniquesfor GPU Computing Using CUDA
What is Fermi’s SM? • It’s one of 16 Streaming Multi-processors. • Each SM Contains 32CUDA Cores
What’s a CUDA Core? • Each of the 32 CUDA Cores work in tandem, and can have up to 48 threads each for a total of 1536 threads per streaming multi-processor. • Threads can be grouped together into thread blocks. • Threads groups are called Warps of no more than 32 threads each (so, up to a total of 48 warps per SM).
How Do You Use CUDA? • CUDA is C with some Extras:
What is <<<THIS?>>>() All About • CUDA is C, C++, and CUDA Specific Keywords • CUDA compiles with NVCC, which uses G++ • All CUDA code is stored in .CU files, but not all code in .CU files run on the GPU. • Why<<<THIS>>>() is how we execute GPU calls.
Host, Global, and Device • Methods use the CUDA keywords __global__, __host__, and __device__, to specify scope and accessibility. • __host__ All code runs on CPU • __global__ Code runs on GPU and is entry point from CPU (must be void) • __device__ Code runs on GPU and can only be called by other methods running on GPU.
Host, Global, and Device: Example __device__ float blahBlahBlah(float param){ return param;} __global__ void blahBlah(args){ printf(blahBlahBlah(3.1415));} __host__ void blah(){ blahBlah<<<16,1024>>>(args);} First argument is number of kernels, second is number of threads.
Global Memory Coalescing Using Cuda
Global Memory Coalescing • Bandwidth, Bandwidth, Bandwidth! • Global Memory • between kernel and global memory • Shared Memory • warp (group of 32 threads) access to shared memory banks
Shared Memory & Bank Conflict Using CUDA
Shared Memory & Bank Conflict • Application – Matrix Multiplication • The size of a bank is 32 bit – no conflict in most cases for • data type: float (32 bit) • Attempt to generate bank conflict and compare the performance differences • data type: char (8 bit) • data type: double (64 bit) • NVidia Compute Capability
Global Memory Implementationkernel code __global__ void kernel(Matrix A, Matrix B, Matrix C) { int y = threadIdx.x + blockIdx.x * blockDim.x; int x = threadIdx.y + blockIdx.y * blockDim.y; float sum = 0; if (x < C.row && y < C.col) { for (int k = 0; k < A.col; k++) { sum += A.GetValue(x, k) * B.GetValue(k, y); } C.SetValue(x, y, sum); } } Shared Memory Implementation shown in the next page
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { intblockRow = blockIdx.y; intblockCol = blockIdx.x; Matrix Csub = GetSubMatrix(C, blockRow, blockCol); float Cvalue = 0; int row = threadIdx.y; intcol = threadIdx.x; for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; As[row][col] = GetElement(Asub, row, col); Bs[row][col] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue += As[row][e] * Bs[e][col]; } __syncthreads(); } // Write Csub to device memory. Each thread writes one element. SetElement(Csub, row, col, Cvalue); }
Runtime overview– Comparing various data types with basic settings –
Runtime overview– Comparing various data types with basic settings –
32 bit . . . shared[0] shared[1] shared[2] shared[3] . . . shared[0] shared[1] shared[2] shared[3] . . . shared[31] Implementation for the example: type char • 8-bit and 16-bit accesses typically generate bank conflicts. __shared__ char shared[32]; char data = shared[BaseIndex + tid]; (shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank) char data = shared[BaseIndex + 4 * tid];
Implementation for the example: type char for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ char As[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; __shared__ char Bs[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; As[row * 4][col * 4] = GetElement(Asub, row, col); Bs[row * 4][col * 4] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue+= As[row * 4][e * 4] * Bs[e * 4][col * 4]; } __syncthreads(); }
NVidia Compute Capability • Compiler Settings; another factor to consider performance tuning • Compute Capability 1.x (old architecture) • -arch sm_11 • Compute Capability 2.x (Fermi architecture) <Example: Tesla T20 (Hydra), GTX 560 Ti> • -arch sm_20 • The previous run on GTX 560 Ti was compiled in Compute Capability 1.1 • If we compiled Matrix Multiplication app (char) with sm_20 switch...
The new compilation option is much worse...(In Hydra, we don’t see any difference between sm_11 and sm_20)
Matrix Multiplication : type double • A solution (obsolete; using –arch sm_13) __shared__ intshared_lo[32]; __shared__ intshared_hi[32]; shared_lo[BaseIndex + tid] = __double2loint(dataIn); shared_hi[BaseIndex + tid] = __double2hiint(dataIn); double dataOut = __hiloint2double(shared_hi[BaseIndex + tid], shared_lo[BaseIndex + tid]); • For devices of compute capability 2.x, there are no bank conflicts for arrays of double __shared__ double shared[32]; double data = shared[BaseIndex + tid];
Conclusion on Bank Conflict in Shared Memory— So far from the data collected — • In today’s hardware, Bank Conflict is becoming increasingly irrelevant as the hardware advances. • Probably because the overhead to attempt to avoid it is greater than the bank conflict itself. • Select a compiler option to suit the application. • -arch sm_xx
General guideline on Shared Memory • The trend in parallel architecture design is towards an inverse memory hierarchy, where the number of registers is increasing compared to cache and shared memory. GTX 560 Ti: • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 524288 bytes Tesla T20 (Hydra): • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 786432 bytes
General guideline on Shared Memory • Contrary to early NVIDIA documentation, shared memory is not as fast as register memory • Most hardware shares L1 cache and Shared Memory • Current shared memory hardware on the Fermi architecture is a step backward [RF01] • Use registers instead of shared memory whenever possible [VV01]
Experimental Analysis of CPU-GPU Data Transfer Optimization using CUDA
Optimizations by CUDA • Pinned (Non-pagable) Memory Optimization • Decrease the time to copy data from CPU-GPU • Optimization Through Multiple Streams. • Hides the transfer time by overlapped execution of kernel and memory transfers.
Pinned Memory(Review) • What is Pinned Memory? • Pinned Memory Enables: • Concurrent data transfer and kernel execution • Mapping of the host memory • Memory bandwidth is higher • Deals with real rather than virtual address • Does not need intermediate copy buffering
Allocating A Page of Locked Memory In CUDA cudaMallocHost ( void ** ptr, size_t size ) • Allocates page-locked host memory that is accessible to device cudaHostAlloc ( void ** ptr, size_t size, unsigned int flags) • Allocates page-locked host memory that is accessible to device – seems to have more options
Experiment on Pinned Memory #define SIZE (10*1024*1024) // number of bytes in arrays 10 MBytes cudaMalloc((void**)&dev_a, SIZE); // allocate memory on device /* ---------------- COPY USING PINNED MEMORY -------------------- */ cudaHostAlloc((void**)&a, SIZE ,cudaHostAllocDefault)// allocate page-locked memory on CPU for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host } /* ---------------- COPY USING REGULAR MEMORY-------------------- */ a = (int*) malloc(SIZE); // allocate regular memory on host for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host }
Results of Pinned vs. Regular Memory on Hydra Time To Copy In (ms) Data transfer in Bytes
Cuda Streams (Review) • A CUDA Stream is a sequence of operations (commands) that are executed in order. • CUDA streams can be created and executed together and interleaved. • Streams proved a mechanism to overlap memory transfer and computations operations
Creating a Stream • Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMemcpyAsync(…, stream1); MyKernel<<< grid, block, stream1>>>(…); cudaMemcpyAsync(… , stream1); Cannot use regular cudaMemcpy with streams, need asynchronous commands for concurrent operation Stream