1 / 31

Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris

EECE571R -- Harnessing Massively Parallel Processors http://www.ece.ubc.ca/~matei/EECE571/ Lecture 1: Introduction to GPU Programming By Samer Al-Kiswany. Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris. Outline. Hardware Software

monab
Download Presentation

Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECE571R -- Harnessing Massively Parallel Processorshttp://www.ece.ubc.ca/~matei/EECE571/Lecture 1: Introduction to GPU ProgrammingBy Samer Al-Kiswany Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris

  2. Outline Hardware Software Programming Model Optimizations

  3. GPU Architecture Intuition

  4. GPU Architecture Intuition

  5. GPU Architecture Intuition

  6. GPU Architecture Intuition

  7. GPU Architecture Intuition

  8. GPU Architecture Intuition

  9. GPU Architecture Intuition

  10. GPU Architecture Intuition

  11. GPU Architecture Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Processor 1 Processor 2 Processor M Host Machine Multiprocessor N GPU Multiprocessor 2 Host Constant Memory Texture Memory Global Memory

  12. GPU Architecture • SIMD Architecture. • Four memories. • Device (a.k.a. global) • slow – 400-600 cycles access latency • large – 256MB – 1GB • Shared • fast– 4 cycles access latency • small – 16KB • Texture – read only • Constant – read only

  13. GPU Architecture – Program Flow 1 2 4 5 1 2 3 4 5 TPreprocesing + TDataHtoG + TProcessing + TPostProc + TDataGtoH • Preprocessing • Data transfer in • GPU Processing • Data transfer out • Postprocessing 3 TTotal =

  14. Outline Hardware Software Programming Model Optimizations

  15. GPU Programming Model Programming Model: Software representation of the Hardware

  16. GPU Programming Model Block Kernel: A function on the grid

  17. GPU Programming Model

  18. GPU Programming Model

  19. GPU Programming Model In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp

  20. GPU Programming Model • In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp • Threads in a Block can share stat through shared memory • Threads in the Block can synchronies • Global atomic operations

  21. Outline Hardware Software Programming Model Optimizations

  22. Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

  23. Optimizations - Memory Use shared memory Use texture (1D, 2D, or 3D) and constant memory Avoid shared memory bank conflicts Coalesced memory access (one approach: padding)

  24. Optimizations - Memory Bank 0 Bank 1 . . . Bank 15 Bank 0 0 4 bytes Bank 1 4 bytes 4 8 4 bytes Bank 2 4 bytes 16 . . . Shared Memory Complications Shared memory is organized into 16 -1KB banks. Complication I : Concurrent accesses to the same bank will be serialized (bank conflict)  slow down. Tip : Assign different threads to different banks. Complication II : Banks are interleaved.

  25. Optimizations - Memory Global Memory Coalesced Access

  26. Optimizations - Memory Global Memory Non-Coalesced Access

  27. Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

  28. Optimizations - Computation Use 1000s of threads to best use the GPU hardware Use Full Warps (32 threads) (use blocks multiple of 32). Lower code branch divergence. Avoid synchronization Loop unrolling (Less instructions, space for compiler optimizations)

  29. Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

  30. Optimizations – Data Transfer Reduce amount of data transferred between host and GPU Hide transfer overhead through overlapping transfer and computation (Asynchronous transfer)

  31. Summary GPUs are highly parallel devices. Easy to program for (functionality). Hard to optimize for (performance). Optimization: Many optimization, but often you do not need them all (Iteration of profiling and optimization) May bring hard tradeoffs (More coputation vs. less memory, more computation vs. better memory access, ..etc).

More Related