CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Fall 2010 • Jih-Kwon Peir • Computer Information Science Engineering • University of Florida CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Supplement 4: Handling Control Flow

Objective • To understand the implications of control flow on • Branch divergence overhead • SM execution resource utilization • To learn better ways to write code with control flow (reduce branch divergence) • To understand compiler/HW predication designed to reduce the impact of control flow • There is a cost involved

Quick terminology review • Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) • The unit of parallelism in CUDA • Warp: a group of threads executed physicallyin parallel in G80 • Block: a group of threads that are executed together and form the unit of resource assignment • Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

How thread blocks are partitioned • Thread blocks are partitioned into warps • Thread IDs within a warp are consecutive and in increasing order • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • However, the exact size of warps may change from generation to generation • (details will be covered next) • However, DO NOT rely on any ordering between warps (independent) • If there are any dependencies between threads, you must __syncthreads() to get correct results

How thread blocks are partitioned • Assume 2 warps, each has 8 threads • Warp 1: Threads (0,0), (1,0), (2,0), (3,0), (0,1), (1,1), (2,1), (3,1) • Warp 2: Threads (0,2), (1,2), (2,2), (3,2), (0,3), (1,3), (2,3), (3,3)

Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • Different execution paths are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • A common case: avoid divergence when branch condition is a function of thread ID • Example with divergence: • If (threadIdx.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp • Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path

Parallel Reduction • Given an array of values, “reduce” them to a single value in parallel • Examples • Sum reduction: sum of all values in the array • Max reduction: maximum of all values in the array • Typically parallel implementation: • Recursively halve # threads, add two values per thread • Takes log(n) steps for n elements, requires n/2 threads

A Vector Reduction Example • Sum an Array: Using one thread block • Assume an in-place reduction using shared memory • The original vector is in device global memory • The shared memory used to hold a partial sum vector • Each iteration brings the partial sum vector closer to the final sum • The final solution will be in element 0

A simple implementation • Assume we have already loaded array into shared M __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } One element per thread

Vector Reduction with Shared Memory Bank Conflicts Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations

Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 Array elements iterations

Some Observations • In each iterations, two control flow paths will be sequentially traversed for each warp • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence • No more than half of threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time. • After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. • This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire

Short comings of the implementation • Assume we have already loaded array into shared M __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

A better implementation • Assume we have already loaded array into • __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } Compute in adjacent threads

No Divergence until < 16 sub-sums Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4

Some Observations About the New Implementation • Only the last 5 iterations will have divergence • Entire warps will be shut down as iterations progress • For a 512-thread block, 4 iterations to shut down all but one warps in each block • Better resource utilization, will likely retire warps and thus blocks faster • Recall, no bank conflicts either

A Potential Further Refinement but bad idea • For last 6 loops only one warp active (i.e. tid’s 0..31) • Shared reads & writes SIMD synchronous within a warp • So skip __syncthreads() and unroll last 5 iterations unsigned int tid = threadIdx.x; for (unsigned int d = n>>1; d > 32; d >>= 1) { __syncthreads(); if (tid < d) shared[tid] += shared[tid + d]; } __syncthreads(); if (tid <= 32) { // unroll last 6 predicated steps shared[tid] += shared[tid + 32]; shared[tid] += shared[tid + 16]; shared[tid] += shared[tid + 8]; shared[tid] += shared[tid + 4]; shared[tid] += shared[tid + 2]; shared[tid] += shared[tid + 1]; } This would not work properly if warp size decreases; need __synchthreads() between each statement! However, having ___synchthreads() in if statement is problematic.

Predicated Execution Concept • Handling Branch divergence • <p1> LDR r1,r2,0 • If p1 is TRUE, instruction executes normally • If p1 is FALSE, instruction treated as NOP

Predication Example : : if (x == 10) c = c + 1; : : : : LDR r5, X p1 <- r5 eq 10 <p1> LDR r1 <- C <p1> ADD r1, r1, 1 <p1> STR r1 -> C : :

Predication very helpful for if-else A A B C D B C D

If-else example : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p1> inst 2 from B <p1> : : <p2> inst 1 from C <p2> inst 2 from C : : : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p2> inst 1 from C <p1> inst 2 from B <p2> inst 2 from C <p1> : : schedule The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.

Instruction Predication in G80 • Comparison instructions set condition codes (CC) • Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) • Compiler tries to predict if a branch condition is likely to produce many divergent warps • If guaranteed not to diverge: only predicates if < 4 instructions • If not guaranteed: only predicates if < 7 instructions • May replace branches with instruction predication • ALL predicated instructions take execution cycles • Those with false conditions don’t write their output • Or invoke memory loads and stores • Saves branch instructions, so can be cheaper than serializing divergent paths

For more information on instruction predication “A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. HwuProceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 138-150 http://www.crhc.uiuc.edu/IMPACT/ftp/conference/isca-95-partial-pred.pdf Also available in Readings in Computer Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000

Performance with Prefetch and Unroll

Graph Algorithm on CUDA

Number of Vertices 5 2 3 4 1 0 4 6 9 12 2 1 5 3 1 4 5 4 1 3 5 5 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Edges Graph Representation on CUDA • Compact edge list representation: • A long vector of vertices • A long vector of edges, edges of vertex i following edges of vertex i+1 • Each entry in Vertex array points to its starting edge list in Edge array • Less space required so larger graphs can be accommodated on the GPU memory • O(V )  O(V+E) Space Complexity O(V+E) 2

Breadth First Search – An Example • Problem: Find the smallest number of edges to reach • every vertex from a given source vertex • Properties: • Follows Levels, once a level is visited it is not visited again, like an explosion • CPU (Host) can be used for level synchronization • GPU used to exploit intra-level parallelism • Each vertex updates the cost (level) of its neighbors • Synchronization issues: No synchronization required as multiple writes don’t cause problem

Breath First Search (BFS) Example Given G(V,E) source (s), compute steps to reach all other Vertexes. Each thread compute one vertex Initially all inactive except source vertex If activated, visit it (visited) and activate its unvisited neighbors n-1 steps needed to reach vertex visited in nth iteration Keep iterating until no active vertexes Synchronization needed after each Iteration (or level) Inactive Active Visited 1st Iteration S S S C C C A A A B B B 2nd Iteration D D D E E E 3rd Iteration; Done

Breadth First Search CUDA Implementation Details • One Thread per vertex (for small size) • Two Boolean arrays Frontier and Visited are used • Each thread looks at its entry in Frontier array • If present in Frontier, it executes / updates the cost (level) of its neighbors • Adds its neighbors to Frontier if already not present in Visited array • Adds its neighbors to Visited array • CPU initiates each Kernel execution • Execution stops when Frontier is empty 2 1 2 2 S 1 1 2 Frontier x x x x x x x x Visited x x x x x x x x Execution Stops

Breadth First Search results • P. Harish , et al. IIIT - Hyderabad , HiPC’07 Results on Random Scale Free Graphs with 0.1% high degree vertices Results on Random Graphs with 6 degree per vertex Results on Real World Data. Avg. degree 2-3 Results on 100K random Graph with varying degree per vertex

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming