CUDA Programming Model Overview Memory Hierarchy and Optimization

CUDA Programming Model OverviewMemory Hierarchy and Optimization YukaiHung a0934147@gmail.comDepartment of MathematicsNational Taiwan University

Warp Occupancy

Warp Occupancy • Warp occupancy • - number of warps running on a multiprocessor concurrently • - limited by the registers and shared memory resource usage • Warp occupancy != performance • - increasing occupancy does not mean increasing performance • - low warp occupancy cannot adequately hide memory latency • - it is very important on memory bandwidth bound applications shader multiprocessor block 0 warp 0 block 1 warp 0 block 0 warp 1 block 1 warp 1 block 0 warp 2 block 1 warp 2 block 0 warp 3 block 1 warp 3 3

Warp Occupancy • Warp occupancy performance • - each thread read and multiply and write one element 4

Warp Occupancy • Registers limitation • - there are 8192 registers on each multiprocessor in GeForce 80 • - register size is hardware implementation decision not software register usage size per block shader multiprocessor 5 blocks 2 blocks register file 5

Warp Occupancy • Registers limitation • - there are 8192 registers on each multiprocessor in GeForce 80 • - register size is hardware implementation decision not software • - registers are dynamically partitioned and assigned across all blocks • - the registers are NOT accessible by other blocks once assignment • - each thread in the same thread block is only accessible to its registers shader multiprocessor 5 blocks 2 blocks register file 6

Warp Occupancy • How many threads can run on each shader multiprocessor? • - let block size is 16x16 and each thread uses 10 registers • - each thread block requires (16x16)x10 = 2560 registers • - 8192 registers is larger than 2560 registers x 3 thread block • - three thread blocks can run on each shader multiprocessor • How about if each thread increases the uses of registers by one? • - each thread block now requires (16x16)x11 = 2816 registers • - 8192 registers is smaller than 2816 registers x 3 thread block • - only two thread blocks can run on each shader multiprocessor • This occurs 1/3 reduction of thread-level parallelism (TLP)!! 7

Warp Occupancy • Dynamic resource partition gives more flexible usage • - larger number of threads but require few registers each • - smaller number of threads but require many registers each • - allow finer grain threading than traditional threading model • Programming tradeoff between two parallel models • - thread-level parallelism by using more threads • - instruction-level parallelism by using more registers 8

Warp Occupancy • Shared memory limitation • - 16KB shared memory on each multiprocessor in GeForce 80 shared memory usage size per block shader multiprocessor 4 blocks 3 blocks shared memory 9

Warp Occupancy • Take minimal number of two resource limitations registers limitation shared memory limitation 4 blocks 2 blocks 10

Warp Occupancy • There are some restrictions on each multiprocessor • Example for compute capability 1.3 • - the number of registers is 16384 floats • - the number of shared memory is 64KB • - the maximum number of active blocks is 8 • - the maximum number of active warps is 32 • - the maximum number of active threads is 1024 11

Warp Occupancy • Measuring warp occupancy per multiprocessor • - latest visual profiler can count warp occupancy • - need device compute capability higher than 1.0 12

Global Memory Optimization

Global Memory Optimization • Possible global memory bus transaction size • - 32 or 64 or 128 bytes for each transaction • - transaction segment must be linear aligned • - first address must be multiple of segment size • Hardware coalescing for each half-warp of threads • - memory accesses are handled by each half-warp • - try to use smallest possible number of transactions • - reduce transaction size to save the memory bandwidth 14

Global Memory Optimization Address 120 Address 120 Address 96 Address 124 Address 124 Address 100 Thread 0 Thread 0 Address 128 Address 128 Address 104 Thread 1 Thread 1 Address 132 Address 132 Address 108 32B segment Thread 2 Thread 2 Address 136 Address 136 Address 112 Thread 3 Thread 3 Address 140 Address 140 Address 116 Thread 4 Thread 4 Address 144 Address 144 Address 120 Thread 5 Thread 5 Address 148 Address 148 Address 124 Thread 0 Thread 6 Thread 6 Address 152 Address 152 Address 128 Thread 1 Thread 7 Thread 7 Address 156 Address 156 Address 132 64B segment Thread 2 Thread 8 Thread 8 Address 160 Address 160 Address 136 Thread 3 Thread 9 Thread 9 Address 164 Address 164 Address 140 Thread 4 Thread 10 Thread 10 Address 168 Address 168 Address 144 Thread 5 Thread 11 Thread 11 Address 172 Address 172 Address 148 128B segment Thread 6 Thread 12 Thread 12 Address 176 Address 176 Address 152 Thread 7 Thread 13 Thread 13 Address 180 Address 180 Address 156 64B segment Thread 8 Thread 14 Thread 14 Address 184 Address 184 Address 160 Thread 9 Thread 15 Thread 15 Address 188 Address 188 Address 164 Thread 10 Address 192 Address 192 Address 168 Thread 11 Address 196 Address 196 Address 172 Thread 12 Address 200 Address 200 Address 176 Thread 13 Address 204 Address 204 Address 180 Thread 14 Address 208 Address 184 ... Thread 15 Address 212 Address 188 Address 214 Address 192 Address 218 Address 252 Address 196 Address 256 Address 222 Address 200 15

Global Memory Optimization • Compute capability 1.0 and 1.1 • - half-warp of threads access a single aligned segment • - half-warp of threads must issue addresses in sequence • - separate into 16 transactions when not satisfy above condition • Compute capability 1.2 and 1.3 • - not require sequential addresses by a half-warp of threads • - performance degrades gracefully when a half-warp of threads • issue addresses separate into multiple memory segments 16

Global Memory Optimization • Coalescing example • - effective bandwidth of small kernels that copy data • - effect of offset and stride on the memory performance • Comparison hardware • - GeForce GTX 280 versus FX 5600 • - compute capability 1.3 versus 1.0 • - memory bandwidth 141GB/s versus 77GB/s 17

Global Memory Optimization __global__ void offsetCopy(float* idata,float* odata,int offset) { • //compute thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=idata[index+offset]; } 18

Global Memory Optimization __global__ void strideCopy(float* idata,float* odata,int stride) { • //compute thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=idata[index*stride]; } 19

Global Memory Optimization • Strided memory accesses is inherent in many applications • - strided size is generally larger than 18 in many applications • How to solve the strided memory problem? • - strided accesses can be avoided by using shared memory 20

Global Memory Optimization • Threads 0-15 access 4-byte data at addresses 116-176 • - thread 0 is the lowest active thread accessing 116 • - use 128-byte memory segment: 0-127 ...... 0 1 2 3 15 224 256 288 0 32 64 96 128 160 192 128-byte segment 21

Global Memory Optimization • Threads 0-15 access 4-byte data at addresses 116-176 • - thread 0 is the lowest active thread accessing 116 • - reduce to 64-byte memory segment: 64-127 ...... 0 1 2 3 15 224 256 288 0 32 64 96 128 160 192 128-byte segment 64-byte segment 22

Global Memory Optimization • Threads 0-15 access 4-byte data at addresses 116-176 • - thread 3 is the lowest active thread accessing 128 • - use 128-byte memory segment: 128-255 ...... 0 1 2 3 15 224 256 288 0 32 64 96 128 160 192 128-byte segment 24

Global Memory Optimization • Threads 0-15 access 4-byte data at addresses 116-176 • - use two transactions to access required memory • - one 32-byte transaction and one 64-byte transaction ...... 0 1 2 3 15 224 256 288 0 32 64 96 128 160 192 64-byte 32-byte 26

Global Memory Optimization • What is global memory coalescing? • - access an aligned segment of global memory by half-warp • - result in as few as possible for only one or two transactions • - depend on compute capability 1.0/1.1 or 1.2/1.3 is different global memory half-warp of threads 27

Global Memory Optimization • Compute capability 1.0 and 1.1 coalescing - 1 transaction out of sequence - 16 transactions misaligned - 16 transactions 28

Global Memory Optimization • Compute capability 1.2 and 1.3 1 transaction - 64-byte segment 2 transactions - 64-byte and 32-byte segment 1 transaction – 128-byte segment 29

Global Memory Optimization 30

Global Memory Optimization • Monitoring memory accesses information • - latest visual profiler can report throughput • - need device compute capability higher than 1.2 31

Shared Memory Optimization

Shared Memory Optimization • Shared memory usage condition • - inter-thread communication within a block • - cache data to reduce global memory accessing • - avoid non-coalescing global memory accessing • Shared memory organization • - shared memory is organized by 16 banks with 4 bytes wide • Shared memory performance • - access 4 bytes for each bank needs one clock cycle • - shared memory accesses are per half-warp threads • - 35-44 GB/s for each multiprocessor and 1.1 TB/s overall • - almost as faster as registers if there are no bank conflicts 33

Shared Memory Optimization • In parallel machine • - many threads access same memory address simultaneously • - memory is divided into several banks to achieve high bandwidth • - multiple threads access the same bank result in bank conflicts shared memory 34

Thread 15 Thread 0 Thread 7 Thread 6 Thread 1 Thread 4 Thread 3 Thread 2 Thread 5 Bank 15 Bank 1 Bank 7 Bank 6 Bank 5 Bank 2 Bank 3 Bank 0 Bank 4 Thread 15 Thread 6 Thread 2 Thread 7 Thread 5 Thread 0 Thread 3 Thread 1 Thread 4 Bank 15 Bank 4 Bank 0 Bank 1 Bank 2 Bank 5 Bank 6 Bank 7 Bank 3 Shared Memory Optimization • Bank conflict • - more than one thread access the same shared memory bank • - threads access data sequentially when occurring bank conflict no bank conflict no bank conflict 35

Thread 11 Thread 15 Thread 10 Thread 1 Thread 2 Thread 7 Thread 9 Thread 4 Thread 0 Thread 3 Thread 8 Thread 0 Thread 6 Thread 4 Thread 3 Thread 2 Thread 5 Thread 1 Bank 15 Bank 15 Bank 3 Bank 0 Bank 7 Bank 6 Bank 5 Bank 4 Bank 1 Bank 8 Bank 2 Bank 0 Bank 9 Bank 2 Bank 7 Bank 1 Shared Memory Optimization • Bank conflict • - more than one thread access the same shared memory bank • - threads access data sequentially when occurring bank conflict 2-way bank conflict stride=2 8-way bank conflict stride=8 x8 x8 36

Thread 15 Bank 6 Thread 2 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 1 Thread 0 Bank 15 Bank 2 Bank 3 Bank 4 Bank 5 Bank 0 Bank 1 Bank 7 Shared Memory Optimization • Broadcasting • - half-warp of threads access same address using only one fetch no bank conflict broadcasting 37

Shared Memory Optimization • Measuring shared memory bank conflicts • - latest visual profiler can count serialize number • - need device compute capability higher than 1.0 38

Registers Optimization

Registers • Accessing register is almost zero extra clock cycles • - delays may occur due to register memory bank conflict • - delays may occur due to register read-after-write dependency • Register delays can be solved by • - register read-after-write delays can be ignored as soon as there • are at least 192 active threads for each multiprocessor to hide them • - compiler achieves registers optimized usage condition when the • number of threads for each thread block is multiple of 64 40

Registers • Register read-after-write dependency • Completely hide the latency • - use at least 192 threads for each shader multiprocessor • - use at least 25% occupancy for 1.0/1.1 and 18.75% for 1.2/1.3 • - all active threads do not need belong to the same thread block x = y + 5; z = x + 3; add.f32 $f3,$f1,$f2 add.f32 $f5,$f3,$f4 s_data[0]+=3; ld.shared.f32 $f3,[$r31+0] add.f32 $f3,$f3,$f4 41

Registers • Registers cannot be used to declare array structure • - any register files do not exist ordering structure • - declare array in registers will be stored into local memory __global__ void kernelFunction(int size) { //locate in the registers float temp1; float temp2[6]; //locate in the local memory float temp3[size]; //for loop on the registers for(int loop=0;loop<6;loop++) temp2[loop]=loop*temp1; } 42

Registers • Registers cannot be used to declare array structure • - any register files do not exist ordering structure • - declare array in registers will be storedinto local memory __global__ void kernelFunction(int size) { //locate in the registers float temp1; //locate in the local memory • float temp2[6]; float temp3[size]; //for loop on the registers for(int loop=0;loop<6;loop++) temp2[loop]=loop*temp1; } 43

Registers • Registers cannot be used to declare array structure • - any register files do not exist ordering structure • - declare array in registers will be storedinto local memory __global__ void kernelFunction(int size) { //locate in the registers float temp1; • float temp2[6]; //locate in the local memory float temp3[size]; //for loop on the registers temp2[0]=0*temp1; temp2[1]=1*temp1; temp2[2]=2*temp1; temp2[3]=3*temp1; temp2[4]=4*temp1; temp2[5]=5*temp1; } 44

Some Examples for Memory Optimization

Matrix Transpose • Example: matrix transpose • - each thread block works on a tiled block of input matrix • - naïve implementation exhibits strided access to memory • - sub-matrix elements transposed by a half-warp of threads input data output data 46

Matrix Transpose • Naïve matrix transpose implementation • - loads are coalesced but stores are not (strided by height) //perform the naïve matrix transpose __global__ void tranpose(float* idata,float* odata,int size) { //compute each thread global index int xindex = blockIdx.x*blockDim.x+threadIdx.x; int yindex = blockIdx.y*blockDim.y+threadIdx.y; //compute load and store element index int indexi = xindex+size*yindex; int indexo = yindex+size*xindex; • //perform matrix transpose for each thread • odata[indexo]=idata[indexi]; } 47

Matrix Transpose • Use tiled algorithm to overcome non-coalescing • - load columns of a tile sub-matrix in shared memory • - store contiguous data to one of rows in global memory • Require __syncthreads() to synchronize thread block • - threads access data in shared memory stored by other threads 48

Matrix Transpose • Tiled matrix transpose implementation //perform the tiled matrix transpose __global__ void tranpose(float* idata,float* odata,int size) { //declare the shared memory __shared__ float tile[TILED_DIM][TILED_DIM]; //compute each thread input index int xindex = blockIdx.x*blockDim.x+threadIdx.x; int yindex = blockIdx.y*blockDim.y+threadIdx.y; int indexi = xindex+size*yindex; //compute each thread output index xindex = blockIdx.y*blockDim.y+threadIdx.x; yindex = blockIdx.x*blockDim.x+threadIdx.y; int indexo = xindex+size*yindex; //load data from global to shared memory tile[threadIdx.y][threadIdx.x]=idata[indexi]; __syncthreads(); • //perform matrix transpose for each thread • odata[indexo]=idata[threadIdx.x][threadIdx.y]; } 49

Matrix Transpose • Use 16-by-16 shared memory tile of floats • - data in the columns are in the same bank • - 16-way bank conflict in reading column valuess 50

CUDA Programming Model Overview Memory Hierarchy and Optimization

CUDA Programming Model Overview Memory Hierarchy and Optimization

Presentation Transcript

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

CUDA Programming,

Using The CUDA Programming Model

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

CUDA Advanced Memory Usage and Optimization

CUDA Programming Model Overview Hardware Hierarchy and Optimization

Memory Hierarchy

CUDA Programming

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

CUDA Programming

Memory hierarchy

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

CUDA Programming Model

Memory Hierarchy