CUDA

CUDA Antonyus Pyetro do Amaral Ferreira

The problem • The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. • The challenge is to develop application software that transparently scales its parallelism.

A solution • CUDA is a parallel programming model and software environmenr. • A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.

CPU vs. GPU

CPU vs. GPU • The GPU is especially well-suited to address problems that can be expressed as data-parallel computations. • Because the same program is executed for each data element, there is a lowerrequirement for sophisticated flow control.

Applications? • General signal processing or physics simulation to computational finance or computational biology. • The latest generation of NVIDIA GPUs, based on the Tesla architecture, supports the CUDA programming model

CUDA - Hello world

What is CUDA? • CUDA extends C by allowing the programmer to define C functions, called kernels,thatare executed N times in parallel by N different CUDA threads. • Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

CUDA Sum of vectors

Concurrency • Threads within a block can cooperate among themselves by sharing data through some shared memory. • __syncthreads() acts as a barrier at which all threads in the block must wait before any are allowed to proceed.

Process Hierarchy

Memory Hierarchy Per-thread local memory Per-blockshared memory

Memory Hierarchy Global Memory

Host and Device • CUDA assumes that the CUDA threads may execute on a physically separate device that operates as a coprocessor to the host. • CUDA also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory

Software Stack Host Application CUDA Libraries CUDA Runtime CUDA Driver Device

Language Extensions • Function type qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device. • Variable type qualifiers to specify the memory location on the device of a variable.

Language Extensions • A new directive to specify how a kernel is executed on the device from the host. vecAdd<<<1, N>>>(A, B, C); • Four built-in variables that specify the grid and block dimensions and the block and thread indices

Function Type Qualifiers • __device__ • Executed on the device • Callable from the device only. • __global__ • Executed on the device • Callable from the host only. • __host__ • Executed on the host • Callable from the host only. • Default type

Variable Type Qualifiers • __device__ global memory space Is accessible from all the threads within the grid • __constant__ constant memory space Is accessible from all the threads within the grid • __shared__ space of a thread block Is only accessible from all the threads within the block

Execution Configuration • Any call to a __global__ function must specify the execution configuration for that call. • The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device.

Execution Configuration • <<< Dg, Db, Ns, S >>> • Dgis of type dim3 and specifies the dimension and size ofthe grid, such that Dg.x * Dg.y equals the number of blocks being launched;Dg.z is unused; • Dbis of type dim3 and specifies the dimension and size ofeach block, such that Db.x * Db.y * Db.z equals the number of threads perblock; • Nsis of type size_t and specifies the number of bytes in shared memory thatis dynamically allocated per block for this call in addition to the staticallyallocated memory; this dynamically allocated memory is used by any of thevariables declared as an external array; Ns is anoptional argument which defaults to 0; • Sis of type cudaStream_t and specifies the associated stream; S is an optionalargument which defaults to 0.

Built-in Variables • gridDim This variable is of type dim3 and contains the dimensions ofthe grid. • blockIdx This variable is of type uint3 and contains the block indexwithin the grid. • blockDim This variable is of type dim3 and contains the dimensions of the block. • threadIdx This variable is of type uint3 and contains the thread indexwithin the block.

Example – Matrix multiplication • Task: C = A(hA,wA) X B(wA, wB) • Each thread block is responsible for computing one square sub-matrix Csub of C; • Each thread within the block is responsible for computing one element of Csub.

Example – Matrix multiplication Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension (wA, block_size) and the sub-matrix of B of dimension (block_size, wA) these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary. Csub is computed as the sum of the products of these square matrices.

Example – Matrix multiplication

Compilation with NVCC MS – visual studio 2005: tools>options>Project&solutions >VC++ directories>include files\libraries files Point to: C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc

CUDA Interoperability • OpenGL • Direct3D

CUDA

CUDA

Presentation Transcript

CUDA Programming,

Cuda

CUDA

CUDA

CUDA Lecture 8 CUDA Memories

CUDA

CUDA Lecture 4 CUDA Programming Basics

CUDA Programming

CUDA 5.0

CUDA

CUDA