ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

GPU Memories These notes will introduce: • The basic memory hierarchy in the NVIDIA GPU • global memory, shared memory, register file, constant memory • How to declare variables for each memory • Memory coalescing • Cache memory and making most effective in program ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt

Host-Device Connection Host (CPU) Device (GPU) Memory bus limited by memory and processor-memory connection bandwidth Hypertransport and Intel’s Quickpath currently 25.6 GB/s GPU bus C2050 1030.4 GB/s GTX 280 141.7 GB/s PCIe x16 4 GB/s PCIe x16 Gen2 8 GB/s peak Device Global Memory GDDR5 230 GB/s Host Memory DDR 400 3.2 GB/s Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfers GPU on a laptop such as Mac pro may share the system memory.

GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory – Pointers created with cudaMalloc() point to global memory. Two principal levels on-chip: shared memory and registers

Scope of global memory, shared memory, and registers Host Grid Block Threads Registers Shared memory Local memory Host memory Global memory Constant memory For storing global constants see later. Also a read-only global memory called texture memory.

Currently can only transfer data from host to global (and constant memory) and not host directly to shared. Constant memory used for data that does not change (i.e. read-only by GPU) Shared memory is said to provide up to 15 x speed of global memory Register similar speed to shared memory if reading same address or no bank conflicts.

Lifetimes Global/constant memory –- lifetime of application Shared memory -– lifetime of a kernel Registers –- lifetime of a kernel Scope Global/constant memory –- Grid Shared memory –- Block Registers –- Thread

Declaring program variables for registers, shared memory and global memory Memory Declaration Scope Lifetime Registers Automatic variables* Thread Kernel other than arrays Local Automatic array variables Thread Kernel Shared __shared__ Block Kernel Global __device__ Grid Application Constant __constant__ Grid Application *Automatic variables allocated automatically when entering scope of variable and de-allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable

Global Memory __device__ #include <stdio.h> #include <stdlib.h> #include <cuda.h> #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … } For data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application

Issues with using Global memory • Long delays, slow • Access congestion • Cannot synchronize accesses • Need to ensure no conflicts of accesses between threads

Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include <stdio.h> #include <stdlib.h> #include <cuda.h> #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }

Transferring data to shared memory int A[N][N]; //to be copied into device from host with cudamalloc __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n]; // declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size); // allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memory myKernel<<G,B>>(dev_A) … }

Issues with Shared Memory Shared memory is not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads()before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide your data into “chunks”

Example uses of shared data Where the data can be divided into independent parts: Image processing - Image can be divided into blocks and placed into shared memory for processing Block matrix multiplication • Sub-matrices can be stored in shared memory • (Slides to follow on this)

Registers Compiler will place variables declared in kernel in registers when possible Limit to the number of registers Fermi has 32768 32-bit registers Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }

Arrays declared within kernel (Automatic array variables) Generally stored in global memory but private copy made for each thread.* Can be as slow access as global memory, except cached, see later If array indexed with a constant value, compiler may use registers __global__ kernel() { int A[10]; … } * Global “local” memory, see later

Constant Memory __constant__ #include <stdio.h> #include <stdlib.h> #include <cuda.h> … __constant__ int n; __global__ kernel() { … } main { n = … … } For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to 65536 bytes

Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them

Cache memory More recent GPUs have L1 and L2 (data) cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s have L1/L2 caches

Fermi Caches Streaming processors (SM) Streaming processors (SM’s) Register file L2 cache L1 cache/ shared memory

Fermi Cache Sizes L2 • Unified 384kB L2 cache for all SM’s • 384-bit memory bus from device memory to L2 cache • Up to 160 GB/s bandwidth • 128 bytes cache line (32 32-bit integers or floats, or 16 doubles) L1 • Each SM has 16kB or 48kB of L1 cache (64kB split 16/48 or 48/16 between L1 cache and shared memory) • No global cache coherency!

Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different line. Fermi line size is 32 integers or floats

Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats

Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance

Memory Banks Device (GPU) A[0] A[1] A[2] A[3] Memory 1 Memory 2 Memory 3 Memory 4 Consecutive locations on successive memory banks Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

Shared Memory Banks Shared memory divided into 16 or 32 banks of 32-bit width. Banks can be accessed simultaneously Compute cap. 1.x has 16 banks accesses processed per half warp Compute cap. 2.x and 3.0/3.5 has 32 banks accesses processed per warp Banks can be accessed simultaneously To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory Exception: all threads read the same location which results in a broadcast operation *coit-grid06 and coit-grid07 C2050 compute capability 2.0 has 32 banks)

Global memory banks Global memory is also partitioned into banks depending upon the version of the GPU 200 series and 10 series NVIDIA GPUs have 8 partitions of 256 bytes wide C2050 has ??

Achieving best data access patterns Requires a lot of thought – will consider in detail for specific problems Generally Padding data to make data aligned For matrix operations Tiling Pre-transpose operations Padding – adding columns/rows

Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address …xxxxx00 A 8 byte read must start at address …xxxx000 A 16 byte read must start at address …xxx0000 Then access is much faster (twice?)

Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: • Thread 0 to access A[0], A[2], ... A[15] • Thread 1 to access A[16], A[17], ... A[31] • Thread 2 to access A[32], A[33], ... A[63] … etc. Good case would be • Thread 0 to access A[0], A[16], ... A[31] • Thread 1 to access A[1], A[17], ... A[32] • Thread 2 to access A[2], A[18], ... A[33] … etc. if there are 16 banks. Need to know that detail! Time

Memory coalescing and cache memories Comp cap 2.x onwards have data caches Accessing locations in cache will still be in blocks of consecutive locations and proper data layout that allows memory coalescing will be advantageous Unfortunately need to know the detailed physical arrangements in global memory and cache to gain maximum benefit. Different comp capability devices have different constraints for memory coalescing, see NVIDIA documentation for more information, see http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0

C2050 coit-grid06 and coit-grid7 Compute capability 2.0 To be installed: K20 coit-grid08 compute capability 3.5 Wikipedia “ CUDA” http://en.wikipedia.org/wiki/CUDA Jan 21, 2013

Some notes on NVIDIA New Tesla K20 GPU card Released late 2012. Uses GK110 chip 7.1 billion transistors! “Big-die” GPU “Kepler” architecture 64KB shared memory/L1 cache 48 KB uniform (?) cache Up to 1.5 MB L2 cache K20 card has 5 GB global memory 320-bit GDDR5 225 watt Sources http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3

Kepler compute architecture • Includes: • SMX (streaming multiprocessor) design that delivers up to 3x more performance per watt compared to the SM in Fermi. It also delivers 1 petaflop of computing in just 10 server racks. • Dynamic Parallelism capability that enables GPU threads to automatically spawn new threads. By adapting to the data without going back to the CPU, it greatly simplifies parallel programming and enables GPU acceleration of a broader set of popular algorithms, like adaptive mesh refinement (AMR), fast multipole method (FMM), and multigrid methods. • Hyper-Q feature that enables multiple CPU cores to simultaneously utilize the CUDA cores on a single Kepler GPU, dramatically increasing GPU utilization, slashing CPU idle times, and advancing programmability. Ideal for cluster applications that use MPI. Source: http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

Questions

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

Presentation Transcript

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt

RENCI@UNC Charlotte

UNC Charlotte

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25 , 2011 DeviceRoutines.pptx

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomicsx

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streamsx

CUDA Lecture 4 CUDA Programming Basics

Jan. 22, 2013

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDADeviceRoutines

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 16, 2013.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 13A, 2013, OpenACC

ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011

UNC Charlotte

ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, April 4, 2013 CUDAProgModel