Cs 179 lecture 4 lab review 2
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

CS 179: Lecture 4 Lab Review 2 PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better)

Download Presentation

CS 179: Lecture 4 Lab Review 2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 179 lecture 4 lab review 2

CS 179: Lecture 4Lab Review 2


Groups of threads hierarchy

Groups of Threads (Hierarchy)

(largest to smallest)

  • “Grid”:

    • All of the threads

    • Size: (number of threads per block) * (number of blocks)

  • “Block”:

    • Size: User-specified

      • Should at least be a multiple of 32 (often, higher is better)

      • Upper limit given by hardware (512 in Tesla, 1024 in Fermi)

    • Features:

      • Shared memory

      • Synchronization


Groups of threads

Groups of Threads

  • “Warp”:

    • Group of 32 threads

    • Execute in lockstep

      (same instructions)

    • Susceptible to divergence!


Divergence

Divergence

“Two roads diverged in a wood…

…and I took both”


Divergence1

Divergence

  • What happens:

    • Executes normally until if-statement

    • Branches to calculate Branch A (blue threads)

    • Goes back (!) and branches to calculate Branch B (red threads)


Divergent tree

“Divergent tree”

Assume 512 threads in block…

… 506, 508, 510

… 500, 504, 508

… 488, 496, 504

… 464, 480, 496


Divergent tree1

“Divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...

set offset to 1

while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

double the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output


Non divergent tree

Example purposes only! Real blocks are way bigger!

“Non-divergent tree”


Non divergent tree1

“Non-divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):

if (thread index < offset):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

halve the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output


Divergent tree where is the divergence

“Divergent tree”Where is the divergence?

  • Two branches:

    • Accumulate

    • Do nothing

  • If the second branch does nothing, then where is the performance loss?


Divergent tree analysis

“Divergent tree” – Analysis

  • First iteration: (Reduce 512 -> 256):

    • Warp of threads 0-31: (After calculating polynomial)

      • Thread 0: Accumulate

      • Thread 1: Do nothing

      • Thread 2: Accumulate

      • Thread 3: Do nothing

    • Warp of threads 32-63:

      • (same thing!)

    • (up to) Warp of threads 480-511

  • Number of executing warps: 512 / 32 = 16


Divergent tree analysis1

“Divergent tree” – Analysis

  • Second iteration: (Reduce 256 -> 128):

    • Warp of threads 0-31: (After calculating polynomial)

      • Threads 0: Accumulate

      • Thread 1-3: Do nothing

      • Thread 4: Accumulate

      • Thread 5-7: Do nothing

    • Warp of threads 32-63:

      • (same thing!)

    • (up to) Warp of threads 480-511

  • Number of executing warps: 16 (again!)


Divergent tree analysis2

“Divergent tree” – Analysis

  • (Process continues, until offset is large enough to separate warps)


Non divergent tree analysis

“Non-divergent tree” – Analysis

  • First iteration: (Reduce 512 -> 256): (Part 1)

    • Warp of threads 0-31:

      • Accumulate

    • Warp of threads 32-63:

      • Accumulate

    • (up to) Warp of threads 224-255

    • Then what?


Non divergent tree analysis1

“Non-divergent tree” – Analysis

  • First iteration: (Reduce 512 -> 256): (Part 2)

    • Warp of threads 256-287:

      • Do nothing!

    • (up to) Warp of threads 480-511

  • Number of executing warps: 256 / 32 = 8 (Was 16 previously!)


Non divergent tree analysis2

“Non-divergent tree” – Analysis

  • Second iteration: (Reduce 256 -> 128):

    • Warp of threads 0-31, …, 96-127:

      • Accumulate

    • Warp of threads 128-159, …,

      480-511

      • Do nothing!

  • Number of executing warps: 128 / 32 = 4 (Was 16 previously!)


What happened

What happened?

  • “Implicit divergence”


Why did we do this

Why did we do this?

  • Performance improvements

  • Reveals GPU internals!


Final puzzle

Final Puzzle

  • What happens when the polynomial order increases?

    • All these threads that we think are competing… are they?


The real world

The Real World


In medicine

In medicine…

  • More sensitive devices -> more data!

  • More intensive algorithms

  • Real-time imaging and analysis

  • Most are parallelizable problems!

http://www.varian.com


Cs 179 lecture 4 lab review 2

MRI

  • “k-space” – Inverse FFT

  • Real-time and high-resolution imaging

http://oregonstate.edu


Ct pet

CT, PET

  • Low-dose techniques

    • Safety!

  • 4D CT imaging

  • X-ray CT vs. PET CT

    • Texture memory!

http://www.upmccancercenter.com/


Radiation therapy

Radiation Therapy

  • Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells

    • More accurate algorithms possible!

      • Accuracy = safety!

    • 40 minutes -> 10 seconds

http://en.wikipedia.org


Notes

Notes

  • Office hours:

    • Kevin: Monday 8-10 PM

    • Ben: Tuesday 7-9 PM

    • Connor: Tuesday 8-10 PM

  • Lab 2: Due Wednesday (4/16), 5 PM


  • Login