Loading in 5 sec....

CS 179: Lecture 4 Lab Review 2PowerPoint Presentation

CS 179: Lecture 4 Lab Review 2

- 66 Views
- Uploaded on
- Presentation posted in: General

CS 179: Lecture 4 Lab Review 2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS 179: Lecture 4Lab Review 2

(largest to smallest)

- “Grid”:
- All of the threads
- Size: (number of threads per block) * (number of blocks)

- “Block”:
- Size: User-specified
- Should at least be a multiple of 32 (often, higher is better)
- Upper limit given by hardware (512 in Tesla, 1024 in Fermi)

- Features:
- Shared memory
- Synchronization

- Size: User-specified

- “Warp”:
- Group of 32 threads
- Execute in lockstep
(same instructions)

- Susceptible to divergence!

“Two roads diverged in a wood…

…and I took both”

- What happens:
- Executes normally until if-statement
- Branches to calculate Branch A (blue threads)
- Goes back (!) and branches to calculate Branch B (red threads)

Assume 512 threads in block…

… 506, 508, 510

… 500, 504, 508

… 488, 496, 504

… 464, 480, 496

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...

set offset to 1

while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

double the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Example purposes only! Real blocks are way bigger!

Assumes block size is power of 2…

//Let our shared memory block be partial_outputs[]...

set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):

if (thread index < offset):

add partial_outputs[thread index + offset] to

partial_outputs[thread index]

halve the offset

synchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

- Two branches:
- Accumulate
- Do nothing

- If the second branch does nothing, then where is the performance loss?

- First iteration: (Reduce 512 -> 256):
- Warp of threads 0-31: (After calculating polynomial)
- Thread 0: Accumulate
- Thread 1: Do nothing
- Thread 2: Accumulate
- Thread 3: Do nothing
- …

- Warp of threads 32-63:
- (same thing!)

- …
- (up to) Warp of threads 480-511

- Warp of threads 0-31: (After calculating polynomial)
- Number of executing warps: 512 / 32 = 16

- Second iteration: (Reduce 256 -> 128):
- Warp of threads 0-31: (After calculating polynomial)
- Threads 0: Accumulate
- Thread 1-3: Do nothing
- Thread 4: Accumulate
- Thread 5-7: Do nothing
- …

- Warp of threads 32-63:
- (same thing!)

- …
- (up to) Warp of threads 480-511

- Warp of threads 0-31: (After calculating polynomial)
- Number of executing warps: 16 (again!)

- (Process continues, until offset is large enough to separate warps)

- First iteration: (Reduce 512 -> 256): (Part 1)
- Warp of threads 0-31:
- Accumulate

- Warp of threads 32-63:
- Accumulate

- …
- (up to) Warp of threads 224-255
- Then what?

- Warp of threads 0-31:

- First iteration: (Reduce 512 -> 256): (Part 2)
- Warp of threads 256-287:
- Do nothing!

- …
- (up to) Warp of threads 480-511

- Warp of threads 256-287:
- Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

- Second iteration: (Reduce 256 -> 128):
- Warp of threads 0-31, …, 96-127:
- Accumulate

- Warp of threads 128-159, …,
480-511

- Do nothing!

- Warp of threads 0-31, …, 96-127:
- Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

- “Implicit divergence”

- Performance improvements
- Reveals GPU internals!

- What happens when the polynomial order increases?
- All these threads that we think are competing… are they?

- More sensitive devices -> more data!
- More intensive algorithms
- Real-time imaging and analysis
- Most are parallelizable problems!

http://www.varian.com

- “k-space” – Inverse FFT
- Real-time and high-resolution imaging

http://oregonstate.edu

- Low-dose techniques
- Safety!

- 4D CT imaging
- X-ray CT vs. PET CT
- Texture memory!

http://www.upmccancercenter.com/

- Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells
- More accurate algorithms possible!
- Accuracy = safety!

- 40 minutes -> 10 seconds

- More accurate algorithms possible!

http://en.wikipedia.org

- Office hours:
- Kevin: Monday 8-10 PM
- Ben: Tuesday 7-9 PM
- Connor: Tuesday 8-10 PM

- Lab 2: Due Wednesday (4/16), 5 PM