The Missouri S&T CS GPU Cluster

The Missouri S&T CS GPU Cluster Cyriac Kandoth

Pretext • NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their GPUs as general purpose devices (GPGPUs) • They donated 8 Tesla GPUs to the Missouri S&T Computer Science Department • This presentation will introduce you to the concept of creating applications that take advantage of the massive parallelism inherent in GPUs

Tech Specs • 4 desktop-style cases, each housing: • Intel Core i7 920 @2.66GHz (Quad core with HyperThreading i.e. 8 logical processors) • 8GB DDR3 1333 @1066MHz (The Core i7 under-clocks the RAM to a speed that it supports) • 500 GB 3.0Gbps 7200rpm SATA Hard disk drive • Two Tesla C1060 cards on PCI-Express 2.0 x16 (A Tesla C1060 is a compute-only GPU that has no video output ports) • ATI Radeon HD2400 on standard PCI (for display)

The Tesla C1060

Cluster networking Gigabit Switch • The 4 nodes are named gpu0 thru gpu3 • gpu0 is the frontend that acts as the gateway into the cluster from the mst.edudomain • At the moment only gpu1 thru gpu3 are available eth1 eth1 eth1 eth1 The Missouri S&T Network eth0 gpu0 gpu1 gpu2 gpu3

The CUDA programming model • CUDA (Compute Unified Device Architecture) is a C/C++ programming model and API (Application Programming Interface) introduced by NVIDIA to enable software developers to code general purpose apps that run on the massively parallel hardware on GPUs. • GPUs are optimal for data parallel apps aka SIMD (Single Instruction Multiple Data). CUDA allows us to also code MIMD apps, but at a reduced efficiency. • Threads running in parallel use extremely fast shared memory for communication. There is no MPI_Send(), but the equivalent of MPI_Barrier() is __syncthreads().

The CUDA programming model • In your code, you can create a kernel (a function) that will run many instances of itself on parallel threads on the GPU. Threads running in parallel are collectively known as a grid. • Kernels are run on the device (GPU) while the rest of the code runs on the host (CPU).

The CUDA programming model • A grid is organized into blocks, and each block is organized into threads. • Only threads within the same block can communicate via shared memory; and sync. • This type of organization helps the GPU parallelize thread execution using it’s inbuilt hardware protocols.

Built-in Variables accessible in a Kernel dim3gridDim • Contains the dimensions of blocks in the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3blockIdx • Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3blockDim • Contains the dimensions of threads in a block (blockDim.x, blockDim.y, and blockDim.z) uint3threadIdx • Contains the thread index within the block (threadIdx.x, threadIdx.y, and threadIdx.z)

E.g. Host invokes kernel on a device // Kernel definition, runs a copy on every thread __global__ voidvectorAdd( float* A, float* B, float* C ) { ... } int main(intargc, char** argv) { dim3blockSize(16, 16); // 256 threads per block (up to 3D) dim3gridSize(4, 2); // 8 blocks in the grid (up to 2D) // Invoke the kernel on the device (GPU) vectorAdd<<<gridSize, blockSize>>>(A, B, C); ... // Continue running on host (CPU) when device is done }

CUDA Type Qualifiers Function type qualifiers __device__ • Executed on the device • Callable from the device only __global__ • Executed on the device • Callable from the host only __host__ • Executed on the host • Callable from the host only • Default type if unspecified • Variable type qualifiers • __device__ • global memory space • Is accessible from all the threads within the grid • __constant__ • constant memory space • Is accessible from all the threads within the grid • __shared__ • space of a thread block • Is only accessible from all the threads within the block

Template of a typical main() int main(intargc, char** argv) { // Allocate memory on the host for input data - malloc() // Initialize input data from file, user input, etc. // Allocate memory on the device - cudaMalloc() // Send input data to the device - cudaMemcpy() // Set up grid and block dimensions - dim3 variables // Invoke the kernel on the device (GPU) - kernelName<<<gridSize, blockSize>>>(input_params); // Copy results from device to host - cudaMemcpy() // Free up device memory - cudaFree() // Print results at the host, because device can’t. // printf() from kernel only works in emulation mode }

CUDA apps in emulation mode • Compile the program with the emu parameter enabled: make emu=1 • The program emulates a GPU on the host CPU. Usually much slower. • Helps with debugging, because you are allowed to use printf() statements in device code (from CUDA kernels)

Types of shared memory • Registers: Fastest form of memory on the GPU. Is only accessible by individual threads and has the lifetime of a thread. We don’t need to deal with it directly (but we can). • Shared Memory: Can be as fast as a register when there are no bank conflicts (when threads read from the same address). Accessible by any thread of the block from which it was created. Has the lifetime of the block. • Global memory: Potentially 150x slower than register or shared memory because of un-coalesced reads and writes. Accessible from either the host or device. Has the lifetime of the application. Read-only global memory is called constant memory. • Local memory: Resides in global memory and can be 150x slower than register/shared memory. Is only accessible by the thread. Has the lifetime of the thread.

A few CUDA API functions • cudaSetDevice(int dev) - Sets the device to run the kernel. • __syncthreads()- Blocks execution of all threads within a block until they synchronize. • cudaMalloc(void** devPtr, size_t count)- Allocates count bytes in GPU memory and returns a pointer to it in the parameter *devPtr. • cudaMemcpy(void* dst, const void* src,size_t count, enumcudaMemcpyKind kind) - copies count bytes from src to dstwhere kind is A complete listing of the CUDA API functions can be found in the Reference Manual.

Tips for speedy code • Have the kernel use the whole card - Have a multiple of 32 threads per block and at least as many blocks as multiprocessors (240 on the Tesla C1060s). • Access global memory properly. Coalescing - Memory read by consecutive threads are combined by the hardware into several, wide memory reads. • Avoid shared memory bank conflicts. • Have as few branching conditional loops as possible. • Have small loops unrolled. • Have no unnecessary __syncthreads() calls. • See the CUDA Programming Guide for further discussion on all of the above.

Questions?

The Missouri S&T CS GPU Cluster