cs 179 lecture 4 lab review 2
Download
Skip this Video
Download Presentation
CS 179: Lecture 4 Lab Review 2

Loading in 2 Seconds...

play fullscreen
1 / 28

CS 179: Lecture 4 Lab Review 2 - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 179: Lecture 4 Lab Review 2' - vangie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
groups of threads hierarchy
Groups of Threads (Hierarchy)

(largest to smallest)

  • “Grid”:
    • All of the threads
    • Size: (number of threads per block) * (number of blocks)
  • “Block”:
    • Size: User-specified
      • Should at least be a multiple of 32 (often, higher is better)
      • Upper limit given by hardware (512 in Tesla, 1024 in Fermi)
    • Features:
      • Shared memory
      • Synchronization
groups of threads
Groups of Threads
  • “Warp”:
    • Group of 32 threads
    • Execute in lockstep

(same instructions)

    • Susceptible to divergence!
divergence
Divergence

“Two roads diverged in a wood…

…and I took both”

divergence1
Divergence
  • What happens:
    • Executes normally until if-statement
    • Branches to calculate Branch A (blue threads)
    • Goes back (!) and branches to calculate Branch B (red threads)
divergent tree
“Divergent tree”

Assume 512 threads in block…

… 506, 508, 510

… 500, 504, 508

… 488, 496, 504

… 464, 480, 496

divergent tree1
“Divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...

set offset to 1

while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

double the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

non divergent tree1
“Non-divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):

if (thread index < offset):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

halve the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

divergent tree where is the divergence
“Divergent tree”Where is the divergence?
  • Two branches:
    • Accumulate
    • Do nothing
  • If the second branch does nothing, then where is the performance loss?
divergent tree analysis
“Divergent tree” – Analysis
  • First iteration: (Reduce 512 -> 256):
    • Warp of threads 0-31: (After calculating polynomial)
      • Thread 0: Accumulate
      • Thread 1: Do nothing
      • Thread 2: Accumulate
      • Thread 3: Do nothing
    • Warp of threads 32-63:
      • (same thing!)
    • (up to) Warp of threads 480-511
  • Number of executing warps: 512 / 32 = 16
divergent tree analysis1
“Divergent tree” – Analysis
  • Second iteration: (Reduce 256 -> 128):
    • Warp of threads 0-31: (After calculating polynomial)
      • Threads 0: Accumulate
      • Thread 1-3: Do nothing
      • Thread 4: Accumulate
      • Thread 5-7: Do nothing
    • Warp of threads 32-63:
      • (same thing!)
    • (up to) Warp of threads 480-511
  • Number of executing warps: 16 (again!)
divergent tree analysis2
“Divergent tree” – Analysis
  • (Process continues, until offset is large enough to separate warps)
non divergent tree analysis
“Non-divergent tree” – Analysis
  • First iteration: (Reduce 512 -> 256): (Part 1)
    • Warp of threads 0-31:
      • Accumulate
    • Warp of threads 32-63:
      • Accumulate
    • (up to) Warp of threads 224-255
    • Then what?
non divergent tree analysis1
“Non-divergent tree” – Analysis
  • First iteration: (Reduce 512 -> 256): (Part 2)
    • Warp of threads 256-287:
      • Do nothing!
    • (up to) Warp of threads 480-511
  • Number of executing warps: 256 / 32 = 8 (Was 16 previously!)
non divergent tree analysis2
“Non-divergent tree” – Analysis
  • Second iteration: (Reduce 256 -> 128):
    • Warp of threads 0-31, …, 96-127:
      • Accumulate
    • Warp of threads 128-159, …,

480-511

      • Do nothing!
  • Number of executing warps: 128 / 32 = 4 (Was 16 previously!)
what happened
What happened?
  • “Implicit divergence”
why did we do this
Why did we do this?
  • Performance improvements
  • Reveals GPU internals!
final puzzle
Final Puzzle
  • What happens when the polynomial order increases?
    • All these threads that we think are competing… are they?
in medicine
In medicine…
  • More sensitive devices -> more data!
  • More intensive algorithms
  • Real-time imaging and analysis
  • Most are parallelizable problems!

http://www.varian.com

slide25
MRI
  • “k-space” – Inverse FFT
  • Real-time and high-resolution imaging

http://oregonstate.edu

ct pet
CT, PET
  • Low-dose techniques
    • Safety!
  • 4D CT imaging
  • X-ray CT vs. PET CT
    • Texture memory!

http://www.upmccancercenter.com/

radiation therapy
Radiation Therapy
  • Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells
    • More accurate algorithms possible!
      • Accuracy = safety!
    • 40 minutes -> 10 seconds

http://en.wikipedia.org

notes
Notes
  • Office hours:
    • Kevin: Monday 8-10 PM
    • Ben: Tuesday 7-9 PM
    • Connor: Tuesday 8-10 PM
  • Lab 2: Due Wednesday (4/16), 5 PM
ad