1 / 75

CUDA Lecture 11 Performance Considerations

CUDA Lecture 11 Performance Considerations. Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Preliminaries. Always measure where your time is going! Even if you think you know where it is going Start coarse, go fine-grained as need be

dessa
Download Presentation

CUDA Lecture 11 Performance Considerations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Lecture 11Performance Considerations Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

  2. Preliminaries • Always measure where your time is going! • Even if you think you know where it is going • Start coarse, go fine-grained as need be • Keep in mind Amdahl’s Law when optimizing any part of your code • Don’t continue to optimize once a part is only a small fraction of overall execution time Performance Considerations – Slide 2

  3. Outline • Performance Consideration Issues • Memory coalescing • Shared memory bank conflicts • Control-flow divergence • Occupancy • Kernel launch overheads Performance Considerations – Slide 3

  4. Performance Topic A: Memory Coalescing • Off-chip memory is accessed in chunks • Even if you read only a single word • If you don’t use whole chunk, bandwidth is wasted • Chunks are aligned to multiples of 32/64/128 bytes • Unaligned accesses will cost more • When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Performance Considerations – Slide 4

  5. Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 5

  6. Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 6

  7. Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 2 … T1 T2 T3 T4 Time Period 1 M T1 T2 T3 T4 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Performance Considerations – Slide 7

  8. Memory Layout of a Matrix in C Not coalesced coalesced Md Nd Thread 1 H T D I Thread 2 W WIDTH Performance Considerations – Slide 8

  9. Use Shared Memory to Improve Coalescing Md Nd Original H T Access D I W Pattern WIDTH Copy into scratchpad memory Md Nd Tiled Access Perform Pattern multiplication with scratchpad values Performance Considerations – Slide 9

  10. Second Example • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 0 is lowest active, accesses address 116 • 128-byte segment: 0-127 t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 128B segment Performance Considerations – Slide 10

  11. Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 0 is lowest active, accesses address 116 • 128-byte segment: 0-127 (reduce to 64B) t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 64B segment Performance Considerations – Slide 11

  12. Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 0 is lowest active, accesses address 116 • 128-byte segment: 0-127 (reduce to 32B) t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 32B segment Performance Considerations – Slide 12

  13. Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 3 is lowest active, accesses address 128 • 128-byte segment: 128-255 t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 128B segment Performance Considerations – Slide 13

  14. Second Example (cont.) • Threads 0-15 access 4-byte words at addresses 116-176 • Thread 3 is lowest active, accesses address 128 • 128-byte segment: 128-255 (reduce to 64B) t0 t1 t2 t3 t15 ... 0 32 64 224 256 288 96 128 160 192 64B segment Performance Considerations – Slide 14

  15. Consider the stride of your accesses Performance Considerations – Slide 15

  16. Example: Array of Structures (AoS) Performance Considerations – Slide 16

  17. Example: Structure of Arrays (SoA) Performance Considerations – Slide 17

  18. Example: SoA versus AoS Performance Considerations – Slide 18

  19. Example: SoA versus AoS (cont.) • Structure of arrays is often better than array of structures • Very clear win on regular, stride 1 access patterns • Unpredictable or irregular access patterns are case-by-case Performance Considerations – Slide 19

  20. Performance Topic B: Shared Memory Bank Conflicts • As seen each SM has 16 KB of shared memory • 16 banks of 32-bit words (Tesla) • CUDA uses shared memory as shared storage visible to all threads in a thread block • read and write access • Not used explicitly for pixel shader programs • we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU Performance Considerations – Slide 20

  21. Shared Memory • So shared memory is banked • Only matters for threads within a warp • Full performance with some restrictions • Threads can each access different banks • Or can all access the same value • Consecutive words are in different banks • If two or more threads access the same bank but different value, get bank conflicts Performance Considerations – Slide 21

  22. Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Details: Parallel Memory Architecture • In a parallel machine, many threads access memory • Therefore, memory is divided into banks • Essential to achieve high bandwidth • Each bank can service one address per cycle • A memory can service as many simultaneous accesses as it has banks • Multiple simultaneous accesses to a bankresult in a bank conflict • Conflicting accesses are serialized Performance Considerations – Slide 22

  23. Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Thread 1 Bank 1 Thread 1 Thread 2 Bank 2 Bank 2 Thread 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Bank 4 Thread 4 Thread 5 Thread 5 Bank 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Bank 7 Thread 7 Thread 7 Bank 7 Bank 15 Bank 15 Thread 15 Thread 15 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressing, stride == 1 Random 1:1 permutation Performance Considerations – Slide 23

  24. Thread 0 Bank 0 x8 Bank 1 Thread 1 Thread 0 Bank 0 Thread 2 Bank 2 Thread 1 Bank 1 Thread 3 Bank 3 Thread 2 Bank 2 Thread 4 Bank 4 Thread 3 Thread 5 Bank 5 Thread 4 Thread 6 Bank 6 Bank 7 Bank 7 Thread 7 Bank 8 Bank 9 Thread 8 x8 Thread 9 Bank 15 Thread 15 Thread 10 Thread 11 Bank 15 Bank Addressing Examples (cont.) Two-way Bank Conflicts Eight-way Bank Conflicts Linear addressing stride == 2 Linear addressing stride == 8 Performance Considerations – Slide 24

  25. How addresses map to banks on G80 • Each bank has a bandwidth of 32 bits per clock cycle • Successive 32-bit words are assigned to successive banks • G80 has 16 banks • So bank = address % 16 • Same as the size of a half-warp • No bank conflicts between different half-warps, only within a single half-warp Performance Considerations – Slide 25

  26. Shared memory bank conflicts • Shared memory is as fast as registers if there are no bank conflicts • The fast case: • If all threads of a half-warp access different banks, there is no bank conflict • If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) • The slow case: • Bank conflict: multiple threads in the same half-warp access the same bank • Must serialize the accesses • Cost = max # of simultaneous accesses to a single bank Performance Considerations – Slide 26

  27. Trick to Assess Impact On Performance • Change all shared memory reads to the same value • All broadcasts = no conflicts • Will show how much performance could be improved by eliminating bank conflicts • The same doesn’t work for shared memory writes • So, replace shared memory array indices with threadIdx.x • Can also be done to the reads Performance Considerations – Slide 27

  28. Linear Addressing • Given: • This is only bank-conflict-free if s shares no common factors with the number of banks • 16 on G80, so smust be odd Performance Considerations – Slide 28

  29. Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Bank 1 Thread 1 Thread 1 Thread 2 Thread 2 Bank 2 Bank 2 Bank 3 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Bank 6 Thread 6 Bank 7 Thread 7 Thread 7 Bank 7 Bank 15 Thread 15 Bank 15 Thread 15 Linear Addressing Examples s=1 s=3 Performance Considerations – Slide 29

  30. Additional “memories” • texture and __constant__ • Read-only • Data resides in global memory • Different read path: • includes specialized caches Performance Considerations – Slide 30

  31. Constant Memory • Data stored in global memory, read through a constant-cache path • __constant__qualifier in declarations • Can only be read by GPU kernels • Limited to 64KB • To be used when all threads in a warp read the same address • Serializes otherwise • Throughput: • 32 bits per warp per clock per multiprocessor Performance Considerations – Slide 31

  32. Constants • Immediate address constants • Indexed address constants • Constants stored in DRAM, and cached on chip • L1 per SM • A constant value can be broadcast to all threads in a warp • Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU Performance Considerations – Slide 32

  33. Performance Topic C: Control Flow Divergence • Objectives • To understand the implications of control flow on • Branch divergence overhead • SM execution resource utilization • To learn better ways to write code with control flow • To understand compiler/HW predication designed to reduce the impact of control flow • There is a cost involved. Performance Considerations – Slide 33

  34. Quick terminology review • Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) • The unit of parallelism in CUDA • Warp: a group of threads executed physically in parallel in G80 • Block: a group of threads that are executed together and form the unit of resource assignment • Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect Performance Considerations – Slide 34

  35. How thread blocks are partitioned • Thread blocks are partitioned into warps with instructions issued per 32 threads (warp) • Thread IDs within a warp are consecutive and increasing • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • The exact size of warps may change from generation to generation Performance Considerations – Slide 35

  36. How thread blocks are partitioned (cont.) • However, DO NOT rely on any ordering between warps • If there are any dependencies between threads, you must __syncthreads()to get correct results Performance Considerations – Slide 36

  37. Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • if-else, ... • Different execution paths within a warp are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • Different warps can execute different code with no impact on performance Performance Considerations – Slide 37

  38. Control Flow Divergence (cont.) • A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID • Example with divergence: • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Performance Considerations – Slide 38

  39. Control Flow Divergence (cont.) • A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread ID • Example without divergence: • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path Performance Considerations – Slide 39

  40. Parallel Reduction • Given an array of values, “reduce” them to a single value in parallel • Examples • sum reduction: sum of all values in the array • Max reduction: maximum of all values in the array • Typically parallel implementation: • Recursively halve # threads, add two values per thread • Takes log(n) steps for n elements, requires n/2 threads Performance Considerations – Slide 40

  41. Example: Divergent Iteration Performance Considerations – Slide 41

  42. A Vector Reduction Example • Assume an in-place reduction using shared memory • The original vector is in device global memory • The shared memory used to hold a partial sum vector • Each iteration brings the partial sum vector closer to the final sum • The final solution will be in element 0 Performance Considerations – Slide 42

  43. A simple implementation • Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 43

  44. Vector Reduction with Bank Conflicts Array elements 0 1 2 3 4 5 6 7 8 9 10 11 I T E R A T I O N S 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 Performance Considerations – Slide 44

  45. Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 I T E R A T I O N S 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 Array elements Performance Considerations – Slide 45

  46. Some Observations • In each iteration, two control flow paths will be sequentially traversed for each warp • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence Performance Considerations – Slide 46

  47. Some Observations (cont.) • No more than half of threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time. • After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. • This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire Performance Considerations – Slide 47

  48. Short comings of the implementation • Assume we have already loaded array into __shared__ float partialSum[] BAD: Divergence due to interleaved branch decisions Performance Considerations – Slide 48

  49. A better implementation • Assume we have already loaded array into __shared__ float partialSum[] Performance Considerations – Slide 49

  50. Less Divergence than original Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4 Performance Considerations – Slide 50

More Related