Optimizations Techniques for GPU Computing

Optimizations Techniquesfor GPU Computing Using CUDA

What is Fermi Architecture?

What is Fermi’s SM? • It’s one of 16 Streaming Multi-processors. • Each SM Contains 32CUDA Cores

What’s a CUDA Core? • Each of the 32 CUDA Cores work in tandem, and can have up to 48 threads each for a total of 1536 threads per streaming multi-processor. • Threads can be grouped together into thread blocks. • Threads groups are called Warps of no more than 32 threads each (so, up to a total of 48 warps per SM).

Fermi Architecture

How Do You Use CUDA? • CUDA is C with some Extras:

What is <<<THIS?>>>() All About • CUDA is C, C++, and CUDA Specific Keywords • CUDA compiles with NVCC, which uses G++ • All CUDA code is stored in .CU files, but not all code in .CU files run on the GPU. • Why<<<THIS>>>() is how we execute GPU calls.

Host, Global, and Device • Methods use the CUDA keywords __global__, __host__, and __device__, to specify scope and accessibility. • __host__  All code runs on CPU • __global__  Code runs on GPU and is entry point from CPU (must be void) • __device__  Code runs on GPU and can only be called by other methods running on GPU.

Host, Global, and Device: Example __device__ float blahBlahBlah(float param){ return param;} __global__ void blahBlah(args){ printf(blahBlahBlah(3.1415));} __host__ void blah(){ blahBlah<<<16,1024>>>(args);} First argument is number of kernels, second is number of threads.

Global Memory Coalescing Using Cuda

Global Memory Coalescing • Bandwidth, Bandwidth, Bandwidth! • Global Memory • between kernel and global memory • Shared Memory • warp (group of 32 threads) access to shared memory banks

Global Memory Coalescing

Matrix SumGlobal Memory not Coalesced

Matrix SumGlobal Memory Coalesced

Matrix SumGlobal Memory Coalescing

Shared Memory & Bank Conflict Using CUDA

Shared Memory & Bank Conflict • Application – Matrix Multiplication • The size of a bank is 32 bit – no conflict in most cases for • data type: float (32 bit) • Attempt to generate bank conflict and compare the performance differences • data type: char (8 bit) • data type: double (64 bit) • NVidia Compute Capability

Global Memory Implementationkernel code __global__ void kernel(Matrix A, Matrix B, Matrix C) { int y = threadIdx.x + blockIdx.x * blockDim.x; int x = threadIdx.y + blockIdx.y * blockDim.y; float sum = 0; if (x < C.row && y < C.col) { for (int k = 0; k < A.col; k++) { sum += A.GetValue(x, k) * B.GetValue(k, y); } C.SetValue(x, y, sum); } } Shared Memory Implementation shown in the next page

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { intblockRow = blockIdx.y; intblockCol = blockIdx.x; Matrix Csub = GetSubMatrix(C, blockRow, blockCol); float Cvalue = 0; int row = threadIdx.y; intcol = threadIdx.x; for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; As[row][col] = GetElement(Asub, row, col); Bs[row][col] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue += As[row][e] * Bs[e][col]; } __syncthreads(); } // Write Csub to device memory. Each thread writes one element. SetElement(Csub, row, col, Cvalue); }

Shared Memory Access

Runtime overview– Comparing various data types with basic settings –

32 bit . . . shared[0] shared[1] shared[2] shared[3] . . . shared[0] shared[1] shared[2] shared[3] . . . shared[31] Implementation for the example: type char • 8-bit and 16-bit accesses typically generate bank conflicts. __shared__ char shared[32]; char data = shared[BaseIndex + tid]; (shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank) char data = shared[BaseIndex + 4 * tid];

Implementation for the example: type char for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) { Matrix Asub = GetSubMatrix(A, blockRow, m); Matrix Bsub = GetSubMatrix(B, m, blockCol); __shared__ char As[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; __shared__ char Bs[BLOCK_SIZE * 4][BLOCK_SIZE * 4]; As[row * 4][col * 4] = GetElement(Asub, row, col); Bs[row * 4][col * 4] = GetElement(Bsub, row, col); __syncthreads(); for (int e = 0; e < BLOCK_SIZE; ++e) { Cvalue+= As[row * 4][e * 4] * Bs[e * 4][col * 4]; } __syncthreads(); }

Slight Improvement… However......

In Hydra, the Performance is Worse

NVidia Compute Capability • Compiler Settings; another factor to consider performance tuning • Compute Capability 1.x (old architecture) • -arch sm_11 • Compute Capability 2.x (Fermi architecture) <Example: Tesla T20 (Hydra), GTX 560 Ti> • -arch sm_20 • The previous run on GTX 560 Ti was compiled in Compute Capability 1.1 • If we compiled Matrix Multiplication app (char) with sm_20 switch...

The new compilation option is much worse...(In Hydra, we don’t see any difference between sm_11 and sm_20)

Matrix Multiplication : type double • A solution (obsolete; using –arch sm_13) __shared__ intshared_lo[32]; __shared__ intshared_hi[32]; shared_lo[BaseIndex + tid] = __double2loint(dataIn); shared_hi[BaseIndex + tid] = __double2hiint(dataIn); double dataOut = __hiloint2double(shared_hi[BaseIndex + tid], shared_lo[BaseIndex + tid]); • For devices of compute capability 2.x, there are no bank conflicts for arrays of double __shared__ double shared[32]; double data = shared[BaseIndex + tid];

Matrix Multiplication : type double

Conclusion on Bank Conflict in Shared Memory— So far from the data collected — • In today’s hardware, Bank Conflict is becoming increasingly irrelevant as the hardware advances. • Probably because the overhead to attempt to avoid it is greater than the bank conflict itself. • Select a compiler option to suit the application. • -arch sm_xx

General guideline on Shared Memory • The trend in parallel architecture design is towards an inverse memory hierarchy, where the number of registers is increasing compared to cache and shared memory. GTX 560 Ti: • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 524288 bytes Tesla T20 (Hydra): • Total number of registers available per block: 32768 • Total amount of shared memory per block: 49152 bytes • L2 Cache Size: 786432 bytes

General guideline on Shared Memory • Contrary to early NVIDIA documentation, shared memory is not as fast as register memory • Most hardware shares L1 cache and Shared Memory • Current shared memory hardware on the Fermi architecture is a step backward [RF01] • Use registers instead of shared memory whenever possible [VV01]

Experimental Analysis of CPU-GPU Data Transfer Optimization using CUDA

Optimizations by CUDA • Pinned (Non-pagable) Memory Optimization • Decrease the time to copy data from CPU-GPU • Optimization Through Multiple Streams. • Hides the transfer time by overlapped execution of kernel and memory transfers.

Pinned Memory(Review) • What is Pinned Memory? • Pinned Memory Enables: • Concurrent data transfer and kernel execution • Mapping of the host memory • Memory bandwidth is higher • Deals with real rather than virtual address • Does not need intermediate copy buffering

Allocating A Page of Locked Memory In CUDA cudaMallocHost ( void ** ptr, size_t size ) • Allocates page-locked host memory that is accessible to device cudaHostAlloc ( void ** ptr, size_t size, unsigned int flags) • Allocates page-locked host memory that is accessible to device – seems to have more options

Experiment on Pinned Memory #define SIZE (10*1024*1024) // number of bytes in arrays 10 MBytes cudaMalloc((void**)&dev_a, SIZE); // allocate memory on device /* ---------------- COPY USING PINNED MEMORY -------------------- */ cudaHostAlloc((void**)&a, SIZE ,cudaHostAllocDefault)// allocate page-locked memory on CPU for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host } /* ---------------- COPY USING REGULAR MEMORY-------------------- */ a = (int*) malloc(SIZE); // allocate regular memory on host for(i = 0; i < 100; i++) { cudaMemcpy(dev_a, a , SIZE ,cudaMemcpyHostToDevice); //copy to device cudaMemcpy(a,dev_a, SIZE ,cudaMemcpyDeviceToHost); //copy back to host }

Results of Pinned vs. Regular Memory on Hydra Time To Copy In (ms) Data transfer in Bytes

Cuda Streams (Review) • A CUDA Stream is a sequence of operations (commands) that are executed in order. • CUDA streams can be created and executed together and interleaved. • Streams proved a mechanism to overlap memory transfer and computations operations

Creating a Stream • Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMemcpyAsync(…, stream1); MyKernel<<< grid, block, stream1>>>(…); cudaMemcpyAsync(… , stream1); Cannot use regular cudaMemcpy with streams, need asynchronous commands for concurrent operation Stream

Optimizations Techniques for GPU Computing