1 / 38

Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA. Supada Laosooksathit. NVIDIA Hardware Architecture. Host memory. Recall. 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory.

moya
Download Presentation

Intermediate GPGPU Programming in CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intermediate GPGPU Programming in CUDA Supada Laosooksathit

  2. NVIDIA Hardware Architecture Host memory

  3. Recall • 5 steps for CUDA Programming • Initialize device • Allocate device memory • Copy data to device memory • Execute kernel • Copy data back from device memory

  4. Initialize Device Calls • To select the device associated to the host thread • cudaSetDevice(device) • This function must be called before any __global__ function, otherwise device 0 is automatically selected. • To get number of devices • cudaGetDeviceCount(&devicecount) • To retrieve device’s property • cudaGetDeviceProperties(&deviceProp, device)

  5. Hello World Example • Allocate host and device memory

  6. Hello World Example • Host code

  7. Hello World Example • Kernel code

  8. To Try CUDA Programming • SSH to 138.47.102.111 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH • Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK • Compile the following directories • NVIDIA_GPU_Computing_SDK/shared/ • NVIDIA_GPU_Computing_SDK/C/common/ • The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

  9. Demo • Hello World • Print out block and thread IDs • Vector Add • C = A + B

  10. NVIDIA Hardware Architecture SM

  11. Specifications of a Device • For more details • deviceQuery in CUDA SDK • Appendix F in Programming Guide 4.0

  12. Demo • deviceQuery • Show hardware specifications in details

  13. Memory Optimizations • Reduce the time of memory transfer between host and device • Use asynchronous memory transfer (CUDA streams) • Use zero copy • Reduce the number of transactions between on-chip and off-chip memory • Memory coalescing • Avoid bank conflicts in shared memory

  14. Reduce Time of Host-Device Memory Transfer • Regular memory transfer (synchronously)

  15. Reduce Time of Host-Device Memory Transfer • CUDA streams • Allow overlapping between kernel and memory copy

  16. CUDA Streams Example

  17. CUDA Streams Example

  18. GPU Timers • CUDA Events • An API • Use the clock shade in kernel • Accurate for timing kernel executions • CUDA timer calls • Libraries implemented in CUDA SDK

  19. CUDA Events Example

  20. Demo • simpleStreams

  21. Reduce Time of Host-Device Memory Transfer • Zero copy • Allow device pointers to access page-lockedhost memory directly • Page-locked host memory is allocated by cudaHostAlloc()

  22. Demo • Zero copy

  23. Reduce number of On-chip and Off-chip Memory Transactions • Threads in a warp access global memory • Memory coalescing • Copy a bunch of words at the same time

  24. Memory Coalescing • Threads in a warp access global memory in a straight forward way (4-byte word per thread)

  25. Memory Coalescing • Memory addresses are aligned in the same segment but the accesses are not sequential

  26. Memory Coalescing • Memory addresses are not aligned in the same segment

  27. Shared Memory • 16 banks for compute capability 1.x, 32 banks for compute capability 2.x • Help utilizing memory coalescing • Bank conflicts may occur • Two or more threads in access the same bank • In compute capability 1.x, no broadcast • In compute capability 2.x, broadcast the same data to many threads that request

  28. Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: Threads: Banks: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

  29. Matrix Multiplication Example

  30. Matrix Multiplication Example • Reduce accesses to global memory • (A.height/BLOCK_SIZE) times reading A • (B.width/BLOCK_SIZE) times reading B

  31. Demo • Matrix Multiplication • With and without shared memory • Different block sizes

  32. Control Flow • if, switch, do, for, while • Branch divergence in a warp • Threads in a warp issue different instruction sets • Different execution paths will be serialized • Increase number of instructions in that warp

  33. Branch Divergence

  34. Summary • 5 steps for CUDA Programming • NVIDIA Hardware Architecture • Memory hierarchy: global memory, shared memory, register file • Specifications of a device: block, warp, thread, SM

  35. Summary • Memory optimization • Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy • Reduce the number of transactions betweenon-chip and off-chip memory by utilizing memory coalescing (shared memory) • Try to avoid bank conflicts in shared memory • Control flow • Try to avoid branch divergence in a warp

  36. References • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ • http://www.developer.nvidia.com/cuda-toolkit

More Related