ME964 High Performance Computing for Engineering Applications

ME964High Performance Computing for Engineering Applications Gauging Kernel Performance Control Flow in CUDA Oct. 9, 2008

Before we get started… • Last Time • Guest Lecturer: Michael Garland, Researcher at NVIDIA • Writing Efficient CUDA Algorithms • Today • Gauging the extent to which you use hardware resources in CUDA • Control Flow in CUDA • Homework related • HW6 available for download (exclusive scan operation) • HW5, 2D matrix convolution, due at 11:59 PM today 2

Exercise: Does Matrix Multiplication Incur Shared Memory Bank Conflicts? Scenario A. The tile matrix is computed as follows: one half warp computes one row of the tile at a time. 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 In scenario A, all threads in a half-warp access the same shared memory entry leading to broadcast. Below what’s highlighted is the second step of computing the 5th row of the tile. 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 In scenario A, all threads in a half-warp access elements in neighboring banks as they walk through the computation 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 Scenario B. The tile matrix is computed as follows: one half warp computes one column of the tile at a time (do it yourself). 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 3 0 1 2 3 4 5 6 7 15

Final Comments, Memory Access • Given the GPU memory spaces and their latency, a typical programming pattern emerges at the thread level: • Load data from device memory into shared memory (coalesced if possible) • Synchronize with al the other threads of the block to avoid data access hazards • Process the data that you just brought over in shared memory • Synchronize as needed • Write the results back to global memory (coalesced if possible) • NOTE: for CUDA computing, always try hard to make your computation fit this model 4

CUDA Programming Common Sense Advice • Keep this in mind: • Allocating memory on device or host is expensive • Moving data back and forth between the host and device is a killer • Global memory accesses are going to be slow • If they are not coalesced they are even slower… • Make sure that you keep the SM • Occupied (currently, 24 warps can be managed concurrently) • Busy (avoid data starvation, have it crunch numbers) • If you can, avoid bank conflicts. Not that big of a deal tough. 5

Gauging the level of HW use http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls • In order to gauge how well your code uses the HW, you need to use the CUDA occupancy calculator (google it) 6

Gauging the level of HW use (cont.) • Three things are asked of you: • Number of threads per block (this is trivial to provide) • Number of registers per thread • Number of bytes of shared memory used by each block • The last two quantities, you get them by adding the “–ptxas-options –v” to the compile command line: $(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options -v -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -I"$(VCInstallDir)\include" -I"$(VCInstallDir)\PlatformSDK\include" -o $(ConfigurationName)\matrixmul.gen.c matrixmul.cu • In Visual Studio, right-click the main .cu file, go to properties, and edit the Custom Build Step by adding “–ptxas-options –v” 7

Gauging the level of HW use (cont.) • Open in a text editor the object file to find, in a pile of stuff that doesn’t make any sense, a chunk of text that looks like this: code { name = _Z15MatrixMulKernel6MatrixS_S_ lmem = 0 smem = 2112 reg = 14 bar = 0 bincode { 0x3004c815 0xe43007c0 0x10008025 0x00000003 0xd0800205 0x00400780 0xa000000d 0x04000780 0xa0000205 0x04000780 0x10056003 0x00000100 0x30040601 0xc4100780 0x20000001 0x04004780 • This is telling you that MatrixMulKernel (which is the name I gave my kernel) uses 2112 bytes in shared memory, 14 registers per thread, and that there is no use of the local memory (lmem) 8

Alternatively, in Developer Studio: This is what you are interested in: smem: 2672 bytes registers: 9 9

End: Discussion on Memory Spaces (Access and Latency Issues) Begin: Control Flow 10

Objective • Understand the implications of control flow on • Branch divergence overhead • SM execution resource utilization • Learn better ways to write code with control flow • Understand compiler/HW predication • An idea meant to reduce the impact of control flow • There is a cost involved with this process. 11 HK-UIUC

Quick terminology review • Thread: concurrent code executed and an associated state on the CUDA device (in parallel with other threads) • The unit of parallelism in CUDA • Number of threads used controlled by user • Warp: a group of threads executed physicallyin parallel in G80 • Number of threads in warp not controlled by user • Block: a group of threads that are executed together and form the unit of resource assignment • Number of blocks used controlled by user • Grid: a group of thread blocks that must all complete before the next phase of the program can begin 12 HK-UIUC

How thread blocks are partitioned • Each thread block is partitioned into warps • Thread IDs within a warp are consecutive and increasing • Remember: In multidimensional blocks, the x thread index runs first, followed by the y thread index, and finally followed by the z thread index • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • However, the exact size of warps may change from release to release • While you can rely on ordering among threads, DO NOT rely on any ordering among warps • Remember, the concept of warp is not something you control through CUDA • If there are any dependencies between threads, you must __syncthreads() to get correct results 13 HK-UIUC

Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • Different execution paths are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • NOTE: Don’t forget that divergence can manifest only at the warp level. You can not discuss this concept in relation to code executed by threads in different warps 14 HK-UIUC

Control Flow Instructions (cont.) • A common case: avoid divergence when branch condition is a function of thread ID • Example with divergence: • If (threadIdx.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp • Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path 15 HK-UIUC

Illustration: Parallel Reduction • Use the “Parallel Reduction” algorithm as a vehicle to discuss the issue of control flow • Given an array of values, “reduce” them in parallel to a single value • Examples • Sum reduction: sum of all values in the array • Max reduction: maximum of all values in the array • Typically parallel implementation: • Recursively halve the number of threads, add two values per thread • Takes log(n) steps for n elements, requires n/2 threads 16 HK-UIUC

A Vector Reduction Example • Assume an in-place reduction using shared memory • We are in the process of summing up a 512 element array • The shared memory used to hold a partial sum vector • Each iteration brings the partial sum vector closer to the final sum • The final sum will be stored in element 0 17 HK-UIUC

A simple implementation • Assume we have already loaded array into • __shared__ float partialSum[] • unsigned int t = threadIdx.x; • for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) • { • __syncthreads(); • if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; • } 18 HK-UIUC

The “Bank Conflicts” Aspect Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations 19 HK-UIUC

The “Branch Divergence” Aspect Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements 20 HK-UIUC

Some Observations • In each iterations, two control flow paths will be sequentially traversed for each warp • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence 21 HK-UIUC

Some Observations (cont.) • No more than half of the threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time. • After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. • This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire 22 HK-UIUC

Shortcomings of the implementation • Assume we have already loaded array into • __shared__ float partialSum[] BAD: Divergence due to interleaved branch decisions • unsigned int t = threadIdx.x; • for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) • { • __syncthreads(); • if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; • } BAD: Bank conflicts due to stride 23 HK-UIUC

A better implementation • Assume we have already loaded array into • __shared__ float partialSum[] • unsigned int t = threadIdx.x; • for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) • { • __syncthreads(); • if (t < stride) • partialSum[t] += partialSum[t+stride]; • } 24 HK-UIUC

No Divergence until < 16 sub-sums Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4 25 HK-UIUC

Some Observations About the New Implementation • Only the last 5 iterations will have divergence • Entire warps will be shut down as iterations progress • For a 512-thread block, 4 iterations to shut down all but one warp in the block • Better resource utilization, will likely retire warps and thus block executes faster • Recall, no bank conflicts either 26 HK-UIUC

A Potential Further Refinement but Bad Idea • For last 6 loops only one warp active (i.e. tid’s 0..31) • Shared reads & writes SIMD synchronous within a warp • So skip __syncthreads() and unroll last 5 iterations • unsigned int tid = threadIdx.x; • for (unsigned int d = n>>1; d > 32; d >>= 1) • { • __syncthreads(); • if (tid < d) shared[tid] += shared[tid + d]; • } • __syncthreads(); • if (tid < 32) { // unroll last 6 predicated steps • shared[tid] += shared[tid + 32]; • shared[tid] += shared[tid + 16]; • shared[tid] += shared[tid + 8]; • shared[tid] += shared[tid + 4]; • shared[tid] += shared[tid + 2]; • shared[tid] += shared[tid + 1]; • } 27 HK-UIUC

A Potential Further Refinement but bad idea • Concluding remarks on the “further refinement”: • This would not work properly is warp size decreases. • Also doesn’t look that attractive if the warp size increases. • Finally you need __synchthreads() between each statement! • Having __synchthreads() in an if-statement is problematic. 28 HK-UIUC

Control Flow Instructions • if, switch, for, while – can significantly impact the effective instruction throughput when threads of the same warp diverge • If this happens, the execution is serialized • This increases the number of instructions executed for this warp • When all the different execution paths have completed, the threads converge back to the same execution path • Not only that you execute more instructions, but you also need logic associated with this process (book-keeping) 29

Predicated Execution Concept • The thread divergence can be avoided in some cases by using the concept of predication • <p1> LDR r1,r2,0 • If p1 is TRUE, the assembly code instruction above executes normally • If p1 is FALSE, instruction treated as NOP 30 HK-UIUC

Predication Example : : if (x == 10) c = c + 1; : : : : LDR r5, X p1 <- r5 eq 10 <p1> LDR r1 <- C <p1> ADD r1, r1, 1 <p1> STR r1 -> C : : 31 HK-UIUC

Predication very helpful for if-else A A B C D B C D 32 HK-UIUC

If-else example : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p1> inst 2 from B <p1> : : <p2> inst 1 from C <p2> inst 2 from C : : : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p2> inst 1 from C <p1> inst 2 from B <p2> inst 2 from C <p1> : : This is what gets scheduled The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence. 33 HK-UIUC

Instruction Predication in G80 • Your comparison instructions set condition codes (CC) • Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) • The compiler tries to predict if a branch condition is likely to produce many divergent warps • If that’s the case, go ahead and predicate if the branch has <7 instructions • If that’s not the case, only predicate if the branch has <4 instructions • Note: it’s pretty bad if you predicate when it was obvious that there would have been no divergence 34 HK-UIUC

Instruction Predication in G80 (cont.) • ALL predicated instructions take execution cycles • Those with false conditions don’t write their output, and do not evaluate addresses or read operands • Saves branch instructions, so can be cheaper than serializing divergent paths • If all this business is confusing, remember this: • Avoid thread divergence • It’s not 100% clear to me, but I believe that there is no cost if a subset of threads belonging to a warp sits there and does nothing while the other warp threads are all running the same instruction 35 HK-UIUC

ME964 High Performance Computing for Engineering Applications