1 / 28

CS 179: Lecture 4 Lab Review 2

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better)

vangie
Download Presentation

CS 179: Lecture 4 Lab Review 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 179: Lecture 4Lab Review 2

  2. Groups of Threads (Hierarchy) (largest to smallest) • “Grid”: • All of the threads • Size: (number of threads per block) * (number of blocks) • “Block”: • Size: User-specified • Should at least be a multiple of 32 (often, higher is better) • Upper limit given by hardware (512 in Tesla, 1024 in Fermi) • Features: • Shared memory • Synchronization

  3. Groups of Threads • “Warp”: • Group of 32 threads • Execute in lockstep (same instructions) • Susceptible to divergence!

  4. Divergence “Two roads diverged in a wood… …and I took both”

  5. Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)

  6. “Divergent tree” Assume 512 threads in block… … 506, 508, 510 … 500, 504, 508 … 488, 496, 504 … 464, 480, 496

  7. “Divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

  8. Example purposes only! Real blocks are way bigger! “Non-divergent tree”

  9. “Non-divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

  10. “Divergent tree”Where is the divergence? • Two branches: • Accumulate • Do nothing • If the second branch does nothing, then where is the performance loss?

  11. “Divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): • Warp of threads 0-31: (After calculating polynomial) • Thread 0: Accumulate • Thread 1: Do nothing • Thread 2: Accumulate • Thread 3: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 512 / 32 = 16

  12. “Divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31: (After calculating polynomial) • Threads 0: Accumulate • Thread 1-3: Do nothing • Thread 4: Accumulate • Thread 5-7: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 16 (again!)

  13. “Divergent tree” – Analysis • (Process continues, until offset is large enough to separate warps)

  14. “Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 1) • Warp of threads 0-31: • Accumulate • Warp of threads 32-63: • Accumulate • … • (up to) Warp of threads 224-255 • Then what?

  15. “Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 2) • Warp of threads 256-287: • Do nothing! • … • (up to) Warp of threads 480-511 • Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

  16. “Non-divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31, …, 96-127: • Accumulate • Warp of threads 128-159, …, 480-511 • Do nothing! • Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

  17. What happened? • “Implicit divergence”

  18. Why did we do this? • Performance improvements • Reveals GPU internals!

  19. Final Puzzle • What happens when the polynomial order increases? • All these threads that we think are competing… are they?

  20. The Real World

  21. In medicine… • More sensitive devices -> more data! • More intensive algorithms • Real-time imaging and analysis • Most are parallelizable problems! http://www.varian.com

  22. MRI • “k-space” – Inverse FFT • Real-time and high-resolution imaging http://oregonstate.edu

  23. CT, PET • Low-dose techniques • Safety! • 4D CT imaging • X-ray CT vs. PET CT • Texture memory! http://www.upmccancercenter.com/

  24. Radiation Therapy • Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells • More accurate algorithms possible! • Accuracy = safety! • 40 minutes -> 10 seconds http://en.wikipedia.org

  25. Notes • Office hours: • Kevin: Monday 8-10 PM • Ben: Tuesday 7-9 PM • Connor: Tuesday 8-10 PM • Lab 2: Due Wednesday (4/16), 5 PM

More Related