Introduction to GPU Programming for EDA & CAD Applications

Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P. Khatri Texas A&M University Acknowledgements: NVIDIA, Nascentric Inc., Accelicon Inc. Students: Kanupriya Gulati, Vinay Karkala, Kalyana Bollapalli

Outline GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A 2

Outline GPU Architecture Overview Evolution and architecture Peak performance GPU and CPU interaction – practical considerations GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A 3

GPU Evolution In the early days, graphics accelerators were primitive Acceleration of graphics rendering tasks for (CRT) displays Many hardwired graphics acceleration units With VLSI technology scaling, the GPU was born Many programmable processors to handle graphics rendering tasks Increased peak memory bandwidths and peak performance Goal was faster and more realistic rendering for gaming applications Recently, several scientific communities began to leverage these GPUs Initially used graphics APIs like OpenGL and DirectX for these tasks GPU vendors recognized this interest Development of C-like programming environments such as CUDA Development of GPU architectures tuned for scientific computations 4

GPU Introduction • A GPU is essentially a commodity stream processor • Highly parallel (100s of processor cores) • Very fast (>900 GFLOPS of peak performance) • Operates in a SIMD manner. This is a key restriction • Multiple processors operate in lock-step (same instruction) but on different data • GPUs, owing to their massively parallel architecture, have been used to accelerate • Image/stream processing, data compression, numerical algorithms • Recently they have been used to accelerate CAD algorithms as well. • Inexpensive, off-the-shelf cards like the NVIDIA Quadro FX / 280 GTX GPU achieve impressive performance • 933 GFLOPs peak performance • 240 SIMD cores partitioned into 30 Multiprocessors (MPs) • 4GB (Quadro) and 1GB (GTX 280) device memory with 142 GB/s bandwidth • 1.4 GHz GPU operating frequency • Programmed with Compute Unified Device Architecture (CUDA) framework

GPU Architecture In the GTX 280, there are 10 Thread Processing Clusters (TPCs) Each has 3 Streaming Multiprocessors (SMs), which we will refer to as multiprocessors (MPs) Each MP has 8 Streaming Processors (SPs) or Thread Processors (TPs). We will refer to these as processors. 240 processors and 30 MPs in all! One double-precision FP unit per SM 6 Source : NVIDIA

GPU vs CPU:NVIDIA 280 vs Intel i7 860 1http://ark.intel.com/Product.aspx?id=41316 2TPC = Thread Processing Cluster (24 cores) 330 multi-processors in a 280 7

GPU vs CPU Peak Performance Trends • GPU peak performance has grown aggressively. • Hardware has kept up with Moore’s law 8 Source : NVIDIA

GPU Programming Model • The GPU is viewed as a computedevicethat: • Is a coprocessor (slave) to the CPU (host) • Has its own DRAM (device memory) but no virtual memory Entire design instance may not fit on the GPU! • Kernel is a CPU-callable function. Thread is an instance of a kernel. • GPU runs many threads in parallel. Device Host (CPU) (GPU) Kernel Threads (instances of the kernel) PCIe Device Memory

Data Transfers (CPUGPU) Device Host (GPU) (CPU) Kernel Threads (instances of the kernel) PCIe Device Memory • GPUs and CPUs communicate via a PCIe bus • This communication is expensive and should be minimized for target applications • Graphics applications usually require • Initial data to be sent from CPU to GPU • Single transfer of processed data from GPU to CPU • General purpose computations usually require • Multiple transfers between CPU and GPU (since conditional checks on CPU) • Possibility of saturating the PCIe bus and reducing the achievable performance 10

GPU Threads v/s CPU Threads • GPU threads: • Lightweight, small creation and scheduling overhead, extremely fast hardware context switching • Need to issue 1000s of GPU threads to hide global memory latencies (600-800 cycles) • CPU threads: • Heavyweight, large scheduling overhead, slow context switching • Multi-GPU usage requires invocation of multiple CPU threads • Each CPU thread creates a GPU context • Context swapping is required for a CPU thread to access GPU memory allocated by another CPU thread 11

Device Memory Space Overview Each thread runs on a SP and has: R/W per-thread registers (on-chip) Limit usage (max 16K/MP) R/W per-thread local memory (off) R/W per-block shared memory (on) Need to avoid bank conflicts R/W per-grid global memory (off) Not cached, 600-800 cycle read Latency hidden by parallelism and fast context switches Main means for data transfer from host and device Coalescing recommended RO per-grid cached constant and texture memory (off) The host can R/W global, constant and texturememories (visible to all threads) (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Source : “NVIDIA CUDA Programming Guide” version 1.1

Outline GPU Architecture Overview GPU Programming CPU threads Conditional and Loop processing Floating point General GPU program structure CUDA and OpenCL Algorithm Acceleration Guidelines Case Studies Conclusion Q&A 13

CPU Threading • CPU • All threads are equivalent • Read/write concurrently to the same memory • Synchronization primitives required to avoid collisions • GPU (NVIDIA) • Each CPU thread maintains a unique context • GPU resources (e.g. memory, code modules, address space) are context-specific • Each CPU thread can access a single context at once • Contexts must be exchanged between CPU threads to share GPU resources between CPU threads • Contexts use reference counting and are automatically destroyed 14

SIMD Conditional Processing Unlike threads in a CPU-based program, SIMD programs cannot follow different execution paths Ideal scenario: All GPU threads follow the same execution path All processors active continuously In divergent paths, some processors execute the then-block and others the else-block Program flow cannot actually diverge. All instructions are executed The then- and else- blocks are both executed A bit is used to enable/disable processors based on the block being executed Parallelism is reduced, impacting performance 15

Idle Processors • Idle CPU processors can be dynamically rescheduled by OS • SIMD processors are not actually idle • All processors scheduled are following identical execution paths • Disabled (idle) processors are unavailable for other work and cannot be rescheduled • Effective utilization of processors is the programmer’s responsibility • Scheduling is an art, not necessarily a science • Techniques will vary from chip to chip 16

Conditional Processing … If (condition) { … } else { … } … 17

Nested Conditional Processing … If (condition) { if (condition2) { … } else { … } } else { … } … 18

Loop Processing … while (condition) { if (cond2) { … } } … 19

The Cost of Memory Access • Registers are extremely fast, but are a limited resource • Cached memories also tend to be small • For large data sets, global memory provides read & write access • Accesses take between 600 and 800 clock cycles • Accesses are *not* cached • To hide memory latency, the hardware provides fast context switches when memory is accessed • However, there must be enough computational work to do to hide the high cost of memory access • Programmers need to be smart • Compilers often don’t provide the necessary optimizations when optimizing for speed instead of code size • It can sometimes be cheaper to recompute a result than perform a memory read/write 20

Conditional Processing … float a = someVar; if (condition) { … } else { … } … Access & Swap Access & Swap Access & Swap ... if (condition) { ... float a = someVar; ... } else { ... float a = someVar; ... } ... 21

Floating Point 1http://www.ddj.com/hpc-high-performance-computing/210102115 • GPUs are optimized for 32-bit accesses • 64-bit double-precision values fetched from memory as two 32-bit quantities • May impact performance in the event of memory bank conflicts • One double-precision unit per multi-processor1 22

OpenCL vs CUDA • CUDA uses early code binding • Code is compiled with normal C/C++/FORTRAN (beta) source code • Need CUDA occupancy calculator to determine number of threads based on resource utilization • Library support: BLAS & FFT & DPT • OpenCL • Late binding of OpenCL code to executable • OpenCL compiler/linker embedded within application • No need for CUDA occupancy calculator • Only supports C • No libraries 23

CUDA Occupancy Calculator 24

OpenCL vs CUDA 25

General Program Structure Initialize GPU Create GPU context Build GPU program Allocate GPU memory Transfer data from CPU to GPU Invoke GPU functions Transfer data from GPU to CPU Deallocate GPU memory Finalize GPU usage 26

Create GPU Context • CUDA • Context creation is implicit in single-threaded programs • Multiple contexts can be explicitly created • Each thread maintains a context stack • Top context is current context • Threads • Contexts can be swapped between threads • A thread can only have one context active at a time (stack) • Acontext cannot be shared simultaneously between threads • OpenCL • All commands explicitly associated with a context • Must create a command queue to invoke 27

Initialize GPU CUDA::CUDA(int Device) : Base() { mValid = false; int DeviceCount; cudaGetDeviceCount( &DeviceCount ); if (!DeviceCount) { return; } Device = Device == -1 ? DeviceCount - 1 : Device; cudaSetDevice( Device ); mValid = true; } • CUDA • cudaGetDeviceCount() • cudaSetDevice() • cudaGetDeviceProperties() 28

Initialize GPU OpenCL::OpenCL(int Device) : Base() { init(); // Initialize class pointers to NULL cl_int RC; mGPUContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &RC ); size_t Bytes; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &Bytes ); int NumDevices = Bytes / sizeof( cl_device_id ); cl_device_id *Devices = new cl_device_id[ NumDevices ]; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, Bytes, Devices, NULL ); mCommandQueue = clCreateCommandQueue( mGPUContext, Devices[ Device ], 0, &RC ); size_t MaxWorkItemSizes[ 256 ]; RC = clGetDeviceInfo( Devices[ Device ], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof( MaxWorkItemSizes ), MaxWorkItemSizes, NULL ); mMaxWorkItems = MaxWorkItemSizes[ 0 ]; mMaxWorkItemsMask = ~(mMaxWorkItems - 1); • OpenCL • Context must be built before anything can be done on the GPU • All commands are with respect to a given context 29

Build GPU Program • CUDA • GPU code is compiled using nvcc compiler • Object code is statically bound to CPU executable • GPU code is intrinsically part of the program • Mapping of problem to threads performed at compile time 30

Build GPU Program // Continued from constructor char *code = shrFindFilePath( ”code.cl", "." ); size_t CodeLength = 0; char *Source = oclLoadProgSource( myCode, "", &CodeLength ); const char *SourceCode = Source; mProgram = clCreateProgramWithSource( mGPUContext, 1, &SourceCode, &CodeLength, &RC ); RC = clBuildProgram( mProgram, 0, NULL, NULL, NULL, NULL ); std::free( code ); std::free( Source ); mValid = RC == CL_SUCCESS; } • OpenCL • GPU code is bound at runtime to the GPU • OpenCL compiler is part of executable • Code can be source code or object code • Source code can be dynamically generated by the program • Can be stored in an external file 31

Allocate/Deallocate GPU Memory void *CUDA::malloc(size_t Bytes) { void *Memory; cudaError_t RC = cudaMalloc( &Memory, Bytes ); return( RC == cudaSuccess ? Memory : NULL ); } void CUDA::free(void *Memory) { if (Memory) { cudaFree( Memory ); } } • CUDA • Most frequently used allocator: cudaMalloc() • Returns a memory pointer to GPU memory • Memory pointer cannot be used by CPU directly • Passed to GPU calls 32

Allocate/Deallocate GPU Memory void *OpenCL::malloc(size_t NumBytes) { size_t Size = NumBytes / 32 + (NumBytes & 31 ? 1 : 0); cl_int RC; cl_mem Memory = clCreateBuffer( mGPUContext, CL_MEM_READ_WRITE, Size, NULL, &RC ); return( RC == CL_SUCCESS ? Memory : NULL ); } void OpenCL::free(void *Memory) { if (Memory) { cl_mem *Ptr = reinterpret_cast<cl_mem>( Memory ); clReleaseMemObject( Memory ); } } • OpenCL • Like all things, memory allocation explicitly performed within a context 33

CPU/GPU Data Transfer • Data moved across PCIe bus • CUDA • Data transfer accomplished via cudaMemcpy() routine • Implicit synchronization point • Non-blocking copies are available • Direction is determined by enumeration • cudaMemcpyHostToDevice • cudaMemcpyDeviceToHost • Allocated memory can be bound to texture memory • cudaBindTexture • OpenCL • Memory transfer via clEnqueueWriteBuffer() and clEnqueueReadBuffer() • Synchronization controlled by parameters to calls • Default is non-blocking 34

Call GPU Functions (Kernels) • Functions in CPU are executed when invoked • GPU function calls from CPU create execution queue • CPU does not wait until GPU function completes – command is simply queued • GPU executes commands on the queue using its own ordering • Synchronization points cause CPU to stall to wait for GPU return • CUDA • cudaThreadSynchronize() 35

GPU Function Calls • GPU function calls have an associated dimensionality (which can be 1D, 2D or 3D) • CUDA • Extended language syntax to include problem dimension • Syntax function<<<dimBlock,dimGrid>>>( arguments ); • OpenCL • Must explicitly put function arguments into context clSetKernelArg() • Invoke kernel using the context • Kernel retrieves arguments from context automatically 36

GPU Cleanup/Termination • CUDA • Manages most cleanup operations automatically as a context is destroyed • OpenCL • Provides low-level APIs for deallocation of all resources • Invoked in order opposite to invocation clReleaseKernel() clReleaseProgram() clReleaseCommandQueue() clReleaseContext() 37

Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks (aka blocks) A thread blockis a batch of threads that can cooperatewith each other by: Synchronizing their execution Diverging execution results in performance loss Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 1) Thread (4, 2) Host Device Kernel 1 Kernel 2 Source : “NVIDIA CUDA Programming Guide” version 1.1

Block and Thread IDs Threads and blocks have IDs So each thread can identify what data they will operate on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes Other problems with underlying 1D, 2D or 3D geometry Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 0) Thread (2, 1) Thread (2, 2) Thread (3, 2) Thread (3, 1) Thread (3, 0) Thread (4, 0) Thread (4, 2) Thread (4, 1) Source : “NVIDIA CUDA Programming Guide” version 1.1

GPU Kernels • Each function is passed data to create a unique ID • Data typically specifies “spatial coordinates” of function execution processor within the hardware • The ID is used to coordinate data access • Ensures that two threads’ accesses do not collide • CUDA function types • __global__ • Callable by CPU • Cannot be called by GPU • __device__ • Callable by other GPU functions • Cannot be called by CPU • CUDA expands these as inline functions via nvcc • Adds to function resource utilization 40

OpenCL Kernel Invocation template<typename T> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, T Arg) { return( clSetKernelArg( Kernel, Pos, sizeof( T ), &Arg ) ); } template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, size_t SharedSize) { // This routine, unlike the others, sets up shared memory by passing // NULL in as the pointer to the variable. return( clSetKernelArg( Kernel, Pos, SharedSize, NULL ) ); } template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, int Arg) { cl_int ArgInt = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgInt ), &ArgInt ) ); } template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, float Arg) { cl_float ArgFloat = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgFloat ), &ArgFloat ) ); } ... template<typename T0> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0) { return( setArg( Kernel, 0, Arg0 ) ); } template<typename T0, typename T1> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1) { return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) ); } template<typename T0, typename T1, typename T2> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2 Arg2) { return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) | setArg( Kernel, 2, Arg2 ) ); } ... Use C++ templates to simplify argument handling 41

OpenCL Kernel Invocation void OpenCL::blasSswap(int n, float *x, int incx, float *y, int incy) { if (!checkBLASKernel( &mSswapKernel, "Sswap" )) { return; } mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; if (x && y) { if (setArgs( mSswapKernel, n, x, incx, y, incy ) == CL_SUCCESS) { executeBLASKernel( mSswapKernel, n ); } } } • BLAS-like example • CUDA provides BLAS library; OpenCL doesn’t • Must write own BLAS routines in OpenCL to port between the two easily • swap() function swaps contents of 2 vectors with differing vector strides 42

OpenCL Kernel Invocation bool OpenCL::checkBLASKernel(cl_kernel *Kernel, const char *KernelName) { if (!mValid) { mLastBLASStatus = Base::BLAS_NOT_INITIALIZED; return( false ); } if (!(*Kernel)) { cl_int RC; *Kernel = clCreateKernel( mProgram, KernelName, &RC ); if (RC != CL_SUCCESS) { mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; return( false ); } } return( true ); } inline void OpenCL::executeBLASKernel(cl_kernel Kernel, int n) { size_t Size = n; size_t GlobalWorkSize = Size & mMaxWorkItemsMask; if (Size & ~mMaxWorkItemsMask) { GlobalWorkSize += mMaxWorkItems; } cl_int RC = clEnqueueNDRangeKernel( mCommandQueue, Kernel, 1, NULL, &GlobalWorkSize, &mMaxWorkItems, 0, NULL, NULL ); clFinish( mCommandQueue ); mLastBLASStatus = (RC == CL_SUCCESS) ? Base::BLAS_SUCCESS : Base::BLAS_EXECUTION_FAILED; } BLAS support functions 43

OpenCL Kernels __kernel void Sswap(__global int n, __global float *x, __global int incx, __global float *y, __global int incy) { const unsigned GID = get_global_id( 0 ); if (GID < n) { int lx = (incx >= 0) ? 0 : ((1 - n) * incx); int ly = (incy >= 0) ? 0 : ((1 - n) * incy); float temp = y[ ly + GID * incy ]; y[ ly + GID * incy ] = x[ lx + GID * incx ]; x[ lx + GID * incx ] = temp; } } http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf BLAS SSWAP example 44

CUDA Kernels #include “kernel.cu” ... { const unsigned int size_x = 256; const unsigned int size_y = 4096; ... dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1); dim3 threads(BLOCK_DIM, BLOCK_DIM, 1); transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); cudaThreadSynchronize(); ... } #define BLOCK_DIM 16 __global__ void transpose_naive(float *odata, float* idata, int width, int height) { unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; unsigned int index_out = yIndex + height * xIndex; odata[index_out] = idata[index_in]; } } CPU GPU (kernel.cu) 45

Outline GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Streams and Pinned Memory Thread Scheduling Parallel reduction Program partitioning Simultaneous graphics and algorithm processing Case Studies Conclusion Q&A 46

Streams Data1 Data2 Data1 Data2 Data2 Data1 H→D Transfers Kernel Computation D→H Transfers Data1 Data2 Data1 Data2 Data1 Data2 H→D Transfers Kernel Computation • Sequence of commands that execute serially • Allow overlapping of memory transfers and kernel computations from different streams • Hides data transfer cost • Implementable in CUDA deviceswith compute capability ≥ 1.1 • Host memory must be of type‘pinned’ 47 D→H Transfers

Pinned Memory • Memory on the host that is mapped to device’s address space and thus accessible directly by a kernel • Has several advantages • There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel • Bandwidth between host and device memories is higher • Write-combining Memory • Type of pinned memory where individual writes are aggregated into a larger write operation • Avoids internal L1, L2 cache writes making more cache available for rest of the application • Is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40% 48

GPU consists of “multiprocessors”, each of which has many processors A kernel is executed as a grid of blocks Thread blockis a batch of threads that cooperatewith each other by: Synchronizing their execution Diverging execution results in performance loss Efficiently sharing data through a low latency shared memory All threads of a block reside on the same multiprocessor (max 1024/MP) Number of blocks a multiprocessor can process at once depends on register and shared memory usage per thread Threads and Scheduling in GPU Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 1) Thread (4, 2) Host Device Kernel 1 Kernel 2 Source : “NVIDIA CUDA Programming Guide” version 1.1

Threads and Scheduling in GPU (contd…) • Before execution a block is split into warps • A warp is a set of 32 threads which execute the same instruction on a MP • Half-warp is either first 16 or second 16 threads of a warp • Full efficiency is realized when all 16 threads of a half-warp agree on their execution path • Branch divergence occurs if threads of a half-warp diverge via a data dependent conditional branch • The half-warp serially executes each branch path taken, ignoring the result from threads that are not on that path • Increases kernel execution time • Warps of the same block are executed in time sliced fashion 50

Introduction to GPU Programming for EDA & CAD Applications

Introduction to GPU Programming for EDA & CAD Applications

Presentation Transcript

GPU programming

GPU programming: CUDA

GPU Programming

CS179: GPU Programming

CUDA GPU Programming

CS179: GPU Programming

Introduction to GPU Programming

CS179: GPU Programming

CS179: GPU Programming

GPU Programming

GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

GPU Programming Overview

GPU Programming with CUDA – Introduction Paul Richmond

CS101c GPU Programming

GPU Programming Paradigms

GPU Programming “Languages”

GPU Programming

CS179: GPU Programming