cuda lecture 7 cuda threads and atomics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CUDA Lecture 7 CUDA Threads and Atomics PowerPoint Presentation
Download Presentation
CUDA Lecture 7 CUDA Threads and Atomics

Loading in 2 Seconds...

play fullscreen
1 / 38

CUDA Lecture 7 CUDA Threads and Atomics - PowerPoint PPT Presentation


  • 151 Views
  • Uploaded on

CUDA Lecture 7 CUDA Threads and Atomics. Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Topic 1: Atomics. The Problem: how do you do global communication? Finish a kernel and start a new one All writes from all threads complete before a kernel finishes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CUDA Lecture 7 CUDA Threads and Atomics' - duy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cuda lecture 7 cuda threads and atomics

CUDA Lecture 7CUDA Threads and Atomics

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

topic 1 atomics
Topic 1: Atomics
  • The Problem: how do you do global communication?
    • Finish a kernel and start a new one
    • All writes from all threads complete before a kernel finishes
    • Would need to decompose kernels into before and after parts

CUDA Threads and Atomics – Slide 2

race conditions
Race Conditions
  • Alternately write to a predefined memory location
    • Race condition! Updates can be lost
    • What is the value of a in thread 0? In thread 1917?

CUDA Threads and Atomics – Slide 3

race conditions cont
Race Conditions (cont.)
  • Thread 0 could have finished execution before 1917 started
  • Or the other way around
  • Or both are executing at the same time
  • Answer: not defined by the programming model; can be arbitrary

CUDA Threads and Atomics – Slide 4

atomics
Atomics
  • CUDA provides atomic operations to deal with this problem
    • An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes
    • The name atomic comes from the fact that it is uninterruptable
    • No dropped data, but ordering is still arbitrary

CUDA Threads and Atomics – Slide 5

atomics cont
Atomics (cont.)
  • CUDA provides atomic operations to deal with this problem
    • Requires hardware with compute capability 1.1 and above
    • Different types of atomic instructions
      • Addition/subtraction: atomicAdd, atomicSub
      • Minimum/maximum: atomicMin, atomicMax
      • Conditional increment/decrement: atomicInc, atomicDec
      • Exchange/compare-and-swap: atomicExch, atomicCAS
    • More types in fermi: atomicAnd, atomicOr, atomicXor

CUDA Threads and Atomics – Slide 6

example histogram
Example: Histogram
  • atomicAdd returns the previous value at a certain address
  • Useful for grabbing variable amounts of data from the list

CUDA Threads and Atomics – Slide 7

example workqueue
Example: Workqueue

CUDA Threads and Atomics – Slide 8

compare and swap
Compare and Swap
  • if compare equals old value stored at address then val is stored instead
  • in either case, routine returns the value of old
  • seems a bizarre routine at first sight, but can be very useful for atomic locks

CUDA Threads and Atomics – Slide 9

compare and swap cont
Compare and Swap (cont.)
  • Most general type of atomic
  • Can emulate all others with CAS

CUDA Threads and Atomics – Slide 10

atomics1
Atomics
  • Atomics are slower than normal load/store
  • Most of these are associative operations on signed/unsigned integers:
    • quite fast for data in shared memory
    • slower for data in device memory
  • You can have the whole machine queuing on a single location in memory
  • Atomics unavailable on G80!

CUDA Threads and Atomics – Slide 11

example global min max na ve
Example: Global Min/Max (Naïve)

CUDA Threads and Atomics – Slide 12

example global min max better
Example: Global Min/Max (Better)

CUDA Threads and Atomics – Slide 13

global max min
Global Max/Min
  • Single value causes serial bottleneck
  • Create hierarchy of values for more parallelism
  • Performance will still be slow, so use judiciously

CUDA Threads and Atomics – Slide 14

atomics summary
Atomics: Summary
  • Can’t use normal load/store for inter-thread communication because of race conditions
  • Use atomic instructions for sparse and/or unpredictable global communication
  • Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

CUDA Threads and Atomics – Slide 15

topic 2 streaming multiprocessor execution and divergence
Topic 2: Streaming Multiprocessor Execution and Divergence
  • How a streaming multiprocessor (SM) executes threads
    • Overview of how a streaming multiprocessor works
    • SIMT Execution
    • Divergence

CUDA Threads and Atomics – Slide 16

scheduling blocks onto sms
Scheduling Blocks onto SMs
  • Hardware schedules thread blocks onto available SMs
    • No guarantee of ordering among thread blocks
    • Hardware will schedule thread blocks as soon as a previous thread block finished

Streaming Multiprocessor

Thread Block 5

Thread Block 27

Thread Block 61

Thread Block 2001

CUDA Threads and Atomics – Slide 17

recall warps
Recall: Warps
  • A warp = 32 threads launched together
    • Usually execute together as well

Control

Control

Control

Control

Control

Control

ALU

ALU

ALU

ALU

ALU

ALU

Control

ALU

ALU

ALU

ALU

ALU

ALU

CUDA Threads and Atomics – Slide 18

mapping of thread blocks
Mapping of Thread Blocks
  • Each thread block is mapped to one or more warps
  • The hardware schedules each warp independently

Thread Block N (128 threads)

TB N W1

TB N W2

TB N W3

TB N W4

CUDA Threads and Atomics – Slide 19

thread scheduling example
Thread Scheduling Example
  • SM implements zero-overhead warp scheduling
    • At any time, only one of the warps is executed by SM*
    • Warps whose next instruction has its inputs ready for consumption are eligible for execution
    • Eligible warps are selected for execution on a prioritized scheduling policy
    • All threads in a warp execute the same instruction when selected

CUDA Threads and Atomics – Slide 20

control flow divergence
Control Flow Divergence
  • Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time
  • What happens if you have the following code?

CUDA Threads and Atomics – Slide 21

control flow divergence cont
Control Flow Divergence (cont.)
  • This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it

CUDA Threads and Atomics – Slide 22

control flow divergence cont1

Branch

Path A

Path B

Control Flow Divergence (cont.)

Branch

Path A

Path B

From Fung et al. MICRO ’07

CUDA Threads and Atomics – Slide 23

control flow divergence cont2
Control Flow Divergence (cont.)
  • Nested branches are handled as well

CUDA Threads and Atomics – Slide 24

control flow divergence cont3
Control Flow Divergence (cont.)

Branch

Branch

Branch

Path A

Path B

Path C

CUDA Threads and Atomics – Slide 25

control flow divergence cont4
Control Flow Divergence (cont.)
  • You don’t have to worry about divergence for correctness
    • Mostly true, except corner cases (for example intra-warp locks)
  • You might have to think about it for performance
    • Depends on your branch conditions

CUDA Threads and Atomics – Slide 26

control flow divergence cont5
Control Flow Divergence (cont.)
  • One solution: NVIDIA GPUs have predicated instructions whichare carried out only if a logical flag is true.
  • In the previous example, all threads compute the logical predicate and two predicated instructions

CUDA Threads and Atomics – Slide 27

control flow divergence cont6
Control Flow Divergence (cont.)
  • Performance drops off with the degree of divergence

CUDA Threads and Atomics – Slide 28

control flow divergence cont7
Control Flow Divergence (cont.)
  • Performance drops off with the degree of divergence
    • In worst case, effectively lose a factor of 32 in performance if one thread needs expensive branch, while rest do nothing
  • Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation:
    • first process list to build two sub-lists of “simple” and “expensive” elements
    • then process two sub-lists separately

CUDA Threads and Atomics – Slide 29

synchronization
Synchronization
  • Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point.
  • When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads();
  • Otherwise, can end up in deadlock

CUDA Threads and Atomics – Slide 30

synchronization cont
Synchronization (cont.)
  • Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities:
    • int __syncthreads_count(predicate)
      • counts how many predicates are true
    • int __syncthreads_and(predicate)
      • returns non-zero (true) if all predicates are true
    • int __syncthreads_or(predicate)
      • returns non-zero (true) if any predicate is true

CUDA Threads and Atomics – Slide 31

warp voting
Warp voting
  • There are similar warp voting instructions which operate at the level of a warp:
    • int __all(predicate)
      • returns non-zero (true) if all predicates in warp are true
    • int __any(predicate)
      • returns non-zero (true) if any predicate is true
    • unsigned int __ballot(predicate)
      • sets nth bit based on nth predicate

CUDA Threads and Atomics – Slide 32

topic 3 locks
Topic 3: Locks
  • Use very judiciously
  • Always include a max_iter in your spinloop!
  • Decompose your data and your locks

CUDA Threads and Atomics – Slide 33

example global atomic lock
Example: Global atomic lock
  • Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlocked

CUDA Threads and Atomics – Slide 34

example global atomic lock cont
Example: Global atomic lock (cont.)

CUDA Threads and Atomics – Slide 35

example global atomic lock cont1
Example: Global atomic lock (cont.)
  • __threadfence_block();
    • wait until all global and shared memory writes are visible to all threads in block
  • __threadfence();
    • wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data)

CUDA Threads and Atomics – Slide 36

summary
Summary
  • lots of esoteric capabilities – don’t worry about most of
  • them
  • essential to understand warp divergence – can have a
  • very big impact on performance
  • __syncthreads(); is vital
  • the rest can be ignored until you have a critical need – then read the documentation carefully and look for examples in the SDK

CUDA Threads and Atomics – Slide 37

end credits
End Credits
  • Based on original material from
    • Oxford University: Mike Giles
    • Stanford University
      • Jared Hoberock, David Tarjan
  • Revision history: last updated 8/8/2011.

CUDA Threads and Atomics – Slide 38