CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better)

CS 179: Lecture 4 Lab Review 2

(largest to smallest)

• “Grid”:

• Size: (number of threads per block) * (number of blocks)

• “Block”:

• Size: User-specified

• Should at least be a multiple of 32 (often, higher is better)

• Upper limit given by hardware (512 in Tesla, 1024 in Fermi)

• Features:

• Shared memory

• Synchronization

• “Warp”:

• Execute in lockstep

(same instructions)

• Susceptible to divergence!

“Two roads diverged in a wood…

…and I took both”

• What happens:

• Executes normally until if-statement

• Branches to calculate Branch A (blue threads)

• Goes back (!) and branches to calculate Branch B (red threads)

### “Divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

set offset to 1

while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0):

double the offset

Example purposes only! Real blocks are way bigger!

### “Non-divergent tree”

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):

halve the offset

### “Divergent tree”Where is the divergence?

• Two branches:

• Accumulate

• Do nothing

• If the second branch does nothing, then where is the performance loss?

### “Divergent tree” – Analysis

• First iteration: (Reduce 512 -> 256):

• Warp of threads 0-31: (After calculating polynomial)

• (same thing!)

• (up to) Warp of threads 480-511

• Number of executing warps: 512 / 32 = 16

### “Divergent tree” – Analysis

• Second iteration: (Reduce 256 -> 128):

• Warp of threads 0-31: (After calculating polynomial)

• (same thing!)

• (up to) Warp of threads 480-511

• Number of executing warps: 16 (again!)

### “Divergent tree” – Analysis

• (Process continues, until offset is large enough to separate warps)

### “Non-divergent tree” – Analysis

• First iteration: (Reduce 512 -> 256): (Part 1)

• Accumulate

• Accumulate

• (up to) Warp of threads 224-255

• Then what?

### “Non-divergent tree” – Analysis

• First iteration: (Reduce 512 -> 256): (Part 2)

• Do nothing!

• (up to) Warp of threads 480-511

• Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

### “Non-divergent tree” – Analysis

• Second iteration: (Reduce 256 -> 128):

• Warp of threads 0-31, …, 96-127:

• Accumulate

• Warp of threads 128-159, …,

480-511

• Do nothing!

• Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

### What happened?

• “Implicit divergence”

### Why did we do this?

• Performance improvements

• Reveals GPU internals!

### Final Puzzle

• What happens when the polynomial order increases?

• All these threads that we think are competing… are they?

### In medicine…

• More sensitive devices -> more data!

• More intensive algorithms

• Real-time imaging and analysis

• Most are parallelizable problems!

### MRI

• “k-space” – Inverse FFT

• Real-time and high-resolution imaging

### CT, PET

• Low-dose techniques

• Safety!

• 4D CT imaging

• X-ray CT vs. PET CT

• Texture memory!

• Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells

• More accurate algorithms possible!

• Accuracy = safety!

• 40 minutes -> 10 seconds

### Notes

• Office hours:

• Kevin: Monday 8-10 PM

• Ben: Tuesday 7-9 PM

• Connor: Tuesday 8-10 PM

• Lab 2: Due Wednesday (4/16), 5 PM