1 / 35

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications. Gauging Kernel Performance Control Flow in CUDA Oct. 9, 2008. Before we get started…. Last Time Guest Lecturer: Michael Garland, Researcher at NVIDIA Writing Efficient CUDA Algorithms Today

phuc
Download Presentation

ME964 High Performance Computing for Engineering Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ME964High Performance Computing for Engineering Applications Gauging Kernel Performance Control Flow in CUDA Oct. 9, 2008

  2. Before we get started… • Last Time • Guest Lecturer: Michael Garland, Researcher at NVIDIA • Writing Efficient CUDA Algorithms • Today • Gauging the extent to which you use hardware resources in CUDA • Control Flow in CUDA • Homework related • HW6 available for download (exclusive scan operation) • HW5, 2D matrix convolution, due at 11:59 PM today 2

  3. Exercise: Does Matrix Multiplication Incur Shared Memory Bank Conflicts? Scenario A. The tile matrix is computed as follows: one half warp computes one row of the tile at a time. 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 In scenario A, all threads in a half-warp access the same shared memory entry leading to broadcast. Below what’s highlighted is the second step of computing the 5th row of the tile. 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 In scenario A, all threads in a half-warp access elements in neighboring banks as they walk through the computation 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 Scenario B. The tile matrix is computed as follows: one half warp computes one column of the tile at a time (do it yourself). 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 3 0 1 2 3 4 5 6 7 15

  4. Final Comments, Memory Access • Given the GPU memory spaces and their latency, a typical programming pattern emerges at the thread level: • Load data from device memory into shared memory (coalesced if possible) • Synchronize with al the other threads of the block to avoid data access hazards • Process the data that you just brought over in shared memory • Synchronize as needed • Write the results back to global memory (coalesced if possible) • NOTE: for CUDA computing, always try hard to make your computation fit this model 4

  5. CUDA Programming Common Sense Advice • Keep this in mind: • Allocating memory on device or host is expensive • Moving data back and forth between the host and device is a killer • Global memory accesses are going to be slow • If they are not coalesced they are even slower… • Make sure that you keep the SM • Occupied (currently, 24 warps can be managed concurrently) • Busy (avoid data starvation, have it crunch numbers) • If you can, avoid bank conflicts. Not that big of a deal tough. 5

  6. Gauging the level of HW use http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls • In order to gauge how well your code uses the HW, you need to use the CUDA occupancy calculator (google it) 6

  7. Gauging the level of HW use (cont.) • Three things are asked of you: • Number of threads per block (this is trivial to provide) • Number of registers per thread • Number of bytes of shared memory used by each block • The last two quantities, you get them by adding the “–ptxas-options –v” to the compile command line: $(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options -v -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -I"$(VCInstallDir)\include" -I"$(VCInstallDir)\PlatformSDK\include" -o $(ConfigurationName)\matrixmul.gen.c matrixmul.cu • In Visual Studio, right-click the main .cu file, go to properties, and edit the Custom Build Step by adding “–ptxas-options –v” 7

  8. Gauging the level of HW use (cont.) • Open in a text editor the object file to find, in a pile of stuff that doesn’t make any sense, a chunk of text that looks like this: code { name = _Z15MatrixMulKernel6MatrixS_S_ lmem = 0 smem = 2112 reg = 14 bar = 0 bincode { 0x3004c815 0xe43007c0 0x10008025 0x00000003 0xd0800205 0x00400780 0xa000000d 0x04000780 0xa0000205 0x04000780 0x10056003 0x00000100 0x30040601 0xc4100780 0x20000001 0x04004780 • This is telling you that MatrixMulKernel (which is the name I gave my kernel) uses 2112 bytes in shared memory, 14 registers per thread, and that there is no use of the local memory (lmem) 8

  9. Alternatively, in Developer Studio: This is what you are interested in: smem: 2672 bytes registers: 9 9

  10. End: Discussion on Memory Spaces (Access and Latency Issues) Begin: Control Flow 10

  11. Objective • Understand the implications of control flow on • Branch divergence overhead • SM execution resource utilization • Learn better ways to write code with control flow • Understand compiler/HW predication • An idea meant to reduce the impact of control flow • There is a cost involved with this process. 11 HK-UIUC

  12. Quick terminology review • Thread: concurrent code executed and an associated state on the CUDA device (in parallel with other threads) • The unit of parallelism in CUDA • Number of threads used controlled by user • Warp: a group of threads executed physicallyin parallel in G80 • Number of threads in warp not controlled by user • Block: a group of threads that are executed together and form the unit of resource assignment • Number of blocks used controlled by user • Grid: a group of thread blocks that must all complete before the next phase of the program can begin 12 HK-UIUC

  13. How thread blocks are partitioned • Each thread block is partitioned into warps • Thread IDs within a warp are consecutive and increasing • Remember: In multidimensional blocks, the x thread index runs first, followed by the y thread index, and finally followed by the z thread index • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • However, the exact size of warps may change from release to release • While you can rely on ordering among threads, DO NOT rely on any ordering among warps • Remember, the concept of warp is not something you control through CUDA • If there are any dependencies between threads, you must __syncthreads() to get correct results 13 HK-UIUC

  14. Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • Different execution paths are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • NOTE: Don’t forget that divergence can manifest only at the warp level. You can not discuss this concept in relation to code executed by threads in different warps 14 HK-UIUC

  15. Control Flow Instructions (cont.) • A common case: avoid divergence when branch condition is a function of thread ID • Example with divergence: • If (threadIdx.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp • Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path 15 HK-UIUC

  16. Illustration: Parallel Reduction • Use the “Parallel Reduction” algorithm as a vehicle to discuss the issue of control flow • Given an array of values, “reduce” them in parallel to a single value • Examples • Sum reduction: sum of all values in the array • Max reduction: maximum of all values in the array • Typically parallel implementation: • Recursively halve the number of threads, add two values per thread • Takes log(n) steps for n elements, requires n/2 threads 16 HK-UIUC

  17. A Vector Reduction Example • Assume an in-place reduction using shared memory • We are in the process of summing up a 512 element array • The shared memory used to hold a partial sum vector • Each iteration brings the partial sum vector closer to the final sum • The final sum will be stored in element 0 17 HK-UIUC

  18. A simple implementation • Assume we have already loaded array into • __shared__ float partialSum[] • unsigned int t = threadIdx.x; • for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) • { • __syncthreads(); • if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; • } 18 HK-UIUC

  19. The “Bank Conflicts” Aspect Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations 19 HK-UIUC

  20. The “Branch Divergence” Aspect Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements 20 HK-UIUC

  21. Some Observations • In each iterations, two control flow paths will be sequentially traversed for each warp • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence 21 HK-UIUC

  22. Some Observations (cont.) • No more than half of the threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time. • After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. • This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire 22 HK-UIUC

  23. Shortcomings of the implementation • Assume we have already loaded array into • __shared__ float partialSum[] BAD: Divergence due to interleaved branch decisions • unsigned int t = threadIdx.x; • for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) • { • __syncthreads(); • if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; • } BAD: Bank conflicts due to stride 23 HK-UIUC

  24. A better implementation • Assume we have already loaded array into • __shared__ float partialSum[] • unsigned int t = threadIdx.x; • for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) • { • __syncthreads(); • if (t < stride) • partialSum[t] += partialSum[t+stride]; • } 24 HK-UIUC

  25. No Divergence until < 16 sub-sums Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4 25 HK-UIUC

  26. Some Observations About the New Implementation • Only the last 5 iterations will have divergence • Entire warps will be shut down as iterations progress • For a 512-thread block, 4 iterations to shut down all but one warp in the block • Better resource utilization, will likely retire warps and thus block executes faster • Recall, no bank conflicts either 26 HK-UIUC

  27. A Potential Further Refinement but Bad Idea • For last 6 loops only one warp active (i.e. tid’s 0..31) • Shared reads & writes SIMD synchronous within a warp • So skip __syncthreads() and unroll last 5 iterations • unsigned int tid = threadIdx.x; • for (unsigned int d = n>>1; d > 32; d >>= 1) • { • __syncthreads(); • if (tid < d) shared[tid] += shared[tid + d]; • } • __syncthreads(); • if (tid < 32) { // unroll last 6 predicated steps • shared[tid] += shared[tid + 32]; • shared[tid] += shared[tid + 16]; • shared[tid] += shared[tid + 8]; • shared[tid] += shared[tid + 4]; • shared[tid] += shared[tid + 2]; • shared[tid] += shared[tid + 1]; • } 27 HK-UIUC

  28. A Potential Further Refinement but bad idea • Concluding remarks on the “further refinement”: • This would not work properly is warp size decreases. • Also doesn’t look that attractive if the warp size increases. • Finally you need __synchthreads() between each statement! • Having __synchthreads() in an if-statement is problematic. 28 HK-UIUC

  29. Control Flow Instructions • if, switch, for, while – can significantly impact the effective instruction throughput when threads of the same warp diverge • If this happens, the execution is serialized • This increases the number of instructions executed for this warp • When all the different execution paths have completed, the threads converge back to the same execution path • Not only that you execute more instructions, but you also need logic associated with this process (book-keeping) 29

  30. Predicated Execution Concept • The thread divergence can be avoided in some cases by using the concept of predication • <p1> LDR r1,r2,0 • If p1 is TRUE, the assembly code instruction above executes normally • If p1 is FALSE, instruction treated as NOP 30 HK-UIUC

  31. Predication Example : : if (x == 10) c = c + 1; : : : : LDR r5, X p1 <- r5 eq 10 <p1> LDR r1 <- C <p1> ADD r1, r1, 1 <p1> STR r1 -> C : : 31 HK-UIUC

  32. Predication very helpful for if-else A A B C D B C D 32 HK-UIUC

  33. If-else example : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p1> inst 2 from B <p1> : : <p2> inst 1 from C <p2> inst 2 from C : : : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p2> inst 1 from C <p1> inst 2 from B <p2> inst 2 from C <p1> : : This is what gets scheduled The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence. 33 HK-UIUC

  34. Instruction Predication in G80 • Your comparison instructions set condition codes (CC) • Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) • The compiler tries to predict if a branch condition is likely to produce many divergent warps • If that’s the case, go ahead and predicate if the branch has <7 instructions • If that’s not the case, only predicate if the branch has <4 instructions • Note: it’s pretty bad if you predicate when it was obvious that there would have been no divergence 34 HK-UIUC

  35. Instruction Predication in G80 (cont.) • ALL predicated instructions take execution cycles • Those with false conditions don’t write their output, and do not evaluate addresses or read operands • Saves branch instructions, so can be cheaper than serializing divergent paths • If all this business is confusing, remember this: • Avoid thread divergence • It’s not 100% clear to me, but I believe that there is no cost if a subset of threads belonging to a warp sits there and does nothing while the other warp threads are all running the same instruction 35 HK-UIUC

More Related