CUDA Lecture 11 Performance Considerations

CUDA Lecture 11Performance Considerations Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Preliminaries • Always measure where your time is going! • Even if you think you know where it is going • Start coarse, go fine-grained as need be • Keep in mind Amdahl’s Law when optimizing any part of your code • Don’t continue to optimize once a part is only a small fraction of overall execution time Performance Considerations – Slide 2

Outline • Performance Consideration Issues • Memory coalescing • Shared memory bank conflicts • Control-flow divergence • Occupancy • Kernel launch overheads Performance Considerations – Slide 3

Performance Topic A: Memory Coalescing • Off-chip memory is accessed in chunks • Even if you read only a single word • If you don’t use whole chunk, bandwidth is wasted • Chunks are aligned to multiples of 32/64/128 bytes • Unaligned accesses will cost more • When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Performance Considerations – Slide 4

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 5

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 6

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 2 … T1 T2 T3 T4 Time Period 1 M T1 T2 T3 T4 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 7

Memory Layout of a Matrix in C Not coalesced coalesced Md Nd Thread 1 H T D I Thread 2 W WIDTH Performance Considerations – Slide 8

Use Shared Memory to Improve Coalescing Md Nd Original H T Access D I W Pattern WIDTH Copy into scratchpad memory Md Nd Tiled Access Perform Pattern multiplication with scratchpad values Performance Considerations – Slide 9

Second Example • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 0 is lowest active, accesses address 116 • 128-byte segment: 0-127 t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 128B segment Performance Considerations – Slide 10

Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 0 is lowest active, accesses address 116 • 128-byte segment: 0-127 (reduce to 64B) t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 64B segment Performance Considerations – Slide 11

Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 3 is lowest active, accesses address 128 • 128-byte segment: 128-255 t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 128B segment Performance Considerations – Slide 13

Consider the stride of your accesses Performance Considerations – Slide 15

Example: Array of Structures (AoS) Performance Considerations – Slide 16

Example: Structure of Arrays (SoA) Performance Considerations – Slide 17

Example: SoA versus AoS Performance Considerations – Slide 18

Example: SoA versus AoS (cont.) • Structure of arrays is often better than array of structures • Very clear win on regular, stride 1 access patterns • Unpredictable or irregular access patterns are case-by-case Performance Considerations – Slide 19

Performance Topic B: Shared Memory Bank Conflicts • As seen each SM has 16 KB of shared memory • 16 banks of 32-bit words (Tesla) • CUDA uses shared memory as shared storage visible to all threads in a thread block • read and write access • Not used explicitly for pixel shader programs • we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU Performance Considerations – Slide 20

Shared Memory • So shared memory is banked • Only matters for threads within a warp • Full performance with some restrictions • Threads can each access different banks • Or can all access the same value • Consecutive words are in different banks • If two or more threads access the same bank but different value, get bank conflicts Performance Considerations – Slide 21

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Details: Parallel Memory Architecture • In a parallel machine, many threads access memory • Therefore, memory is divided into banks • Essential to achieve high bandwidth • Each bank can service one address per cycle • A memory can service as many simultaneous accesses as it has banks • Multiple simultaneous accesses to a bankresult in a bank conflict • Conflicting accesses are serialized Performance Considerations – Slide 22

Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Thread 1 Bank 1 Thread 1 Thread 2 Bank 2 Bank 2 Thread 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Bank 4 Thread 4 Thread 5 Thread 5 Bank 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Bank 7 Thread 7 Thread 7 Bank 7 Bank 15 Bank 15 Thread 15 Thread 15 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressing, stride == 1 Random 1:1 permutation Performance Considerations – Slide 23

Thread 0 Bank 0 x8 Bank 1 Thread 1 Thread 0 Bank 0 Thread 2 Bank 2 Thread 1 Bank 1 Thread 3 Bank 3 Thread 2 Bank 2 Thread 4 Bank 4 Thread 3 Thread 5 Bank 5 Thread 4 Thread 6 Bank 6 Bank 7 Bank 7 Thread 7 Bank 8 Bank 9 Thread 8 x8 Thread 9 Bank 15 Thread 15 Thread 10 Thread 11 Bank 15 Bank Addressing Examples (cont.) Two-way Bank Conflicts Eight-way Bank Conflicts Linear addressing stride == 2 Linear addressing stride == 8 Performance Considerations – Slide 24

How addresses map to banks on G80 • Each bank has a bandwidth of 32 bits per clock cycle • Successive 32-bit words are assigned to successive banks • G80 has 16 banks • So bank = address % 16 • Same as the size of a half-warp • No bank conflicts between different half-warps, only within a single half-warp Performance Considerations – Slide 25

Shared memory bank conflicts • Shared memory is as fast as registers if there are no bank conflicts • The fast case: • If all threads of a half-warp access different banks, there is no bank conflict • If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) • The slow case: • Bank conflict: multiple threads in the same half-warp access the same bank • Must serialize the accesses • Cost = max # of simultaneous accesses to a single bank Performance Considerations – Slide 26

Trick to Assess Impact On Performance • Change all shared memory reads to the same value • All broadcasts = no conflicts • Will show how much performance could be improved by eliminating bank conflicts • The same doesn’t work for shared memory writes • So, replace shared memory array indices with threadIdx.x • Can also be done to the reads Performance Considerations – Slide 27

Linear Addressing • Given: • This is only bank-conflict-free if s shares no common factors with the number of banks • 16 on G80, so smust be odd Performance Considerations – Slide 28

Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Bank 1 Thread 1 Thread 1 Thread 2 Thread 2 Bank 2 Bank 2 Bank 3 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Bank 6 Thread 6 Bank 7 Thread 7 Thread 7 Bank 7 Bank 15 Thread 15 Bank 15 Thread 15 Linear Addressing Examples s=1 s=3 Performance Considerations – Slide 29

Additional “memories” • texture and __constant__ • Read-only • Data resides in global memory • Different read path: • includes specialized caches Performance Considerations – Slide 30

Constant Memory • Data stored in global memory, read through a constant-cache path • __constant__qualifier in declarations • Can only be read by GPU kernels • Limited to 64KB • To be used when all threads in a warp read the same address • Serializes otherwise • Throughput: • 32 bits per warp per clock per multiprocessor Performance Considerations – Slide 31

Constants • Immediate address constants • Indexed address constants • Constants stored in DRAM, and cached on chip • L1 per SM • A constant value can be broadcast to all threads in a warp • Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU Performance Considerations – Slide 32

Performance Topic C: Control Flow Divergence • Objectives • To understand the implications of control flow on • Branch divergence overhead • SM execution resource utilization • To learn better ways to write code with control flow • To understand compiler/HW predication designed to reduce the impact of control flow • There is a cost involved. Performance Considerations – Slide 33

Quick terminology review • Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) • The unit of parallelism in CUDA • Warp: a group of threads executed physically in parallel in G80 • Block: a group of threads that are executed together and form the unit of resource assignment • Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect Performance Considerations – Slide 34

How thread blocks are partitioned • Thread blocks are partitioned into warps with instructions issued per 32 threads (warp) • Thread IDs within a warp are consecutive and increasing • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • The exact size of warps may change from generation to generation Performance Considerations – Slide 35

How thread blocks are partitioned (cont.) • However, DO NOT rely on any ordering between warps • If there are any dependencies between threads, you must __syncthreads()to get correct results Performance Considerations – Slide 36

Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • if-else, ... • Different execution paths within a warp are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • Different warps can execute different code with no impact on performance Performance Considerations – Slide 37

Control Flow Divergence (cont.) • A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID • Example with divergence: • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Performance Considerations – Slide 38

Control Flow Divergence (cont.) • A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID • Example without divergence: • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path Performance Considerations – Slide 39

Parallel Reduction • Given an array of values, “reduce” them to a single value in parallel • Examples • sum reduction: sum of all values in the array • Max reduction: maximum of all values in the array • Typically parallel implementation: • Recursively halve # threads, add two values per thread • Takes log(n) steps for n elements, requires n/2 threads Performance Considerations – Slide 40

Example: Divergent Iteration Performance Considerations – Slide 41

A Vector Reduction Example • Assume an in-place reduction using shared memory • The original vector is in device global memory • The shared memory used to hold a partial sum vector • Each iteration brings the partial sum vector closer to the final sum • The final solution will be in element 0 Performance Considerations – Slide 42

A simple implementation • Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 43

Vector Reduction with Bank Conflicts Array elements 0 1 2 3 4 5 6 7 8 9 10 11 I T E R A T I O N S 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 Performance Considerations – Slide 44

Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 I T E R A T I O N S 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 Array elements Performance Considerations – Slide 45

Some Observations • In each iteration, two control flow paths will be sequentially traversed for each warp • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence Performance Considerations – Slide 46

Some Observations (cont.) • No more than half of threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time. • After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. • This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire Performance Considerations – Slide 47

Short comings of the implementation • Assume we have already loaded array into __shared__ float partialSum[] BAD: Divergence due to interleaved branch decisions Performance Considerations – Slide 48

A better implementation • Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 49

Less Divergence than original Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4 Performance Considerations – Slide 50

CUDA Lecture 11 Performance Considerations