Synchronization
This presentation is the property of its rightful owner.
Sponsored Links
1 / 12

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDASynchronization PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Synchronization. These notes introduce: Ways to achieve thread synchronization. __ syncthreads () cudaThreadSynchronize (). ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDASynchronization.ppt. Thread Barrier Synchronization.

Download Presentation

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDASynchronization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Synchronization

These notes introduce:

  • Ways to achieve thread synchronization.

  • __syncthreads()

  • cudaThreadSynchronize()

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDASynchronization.ppt


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Thread Barrier Synchronization

When we divide a computation into parallel parts to be done concurrently by independent threads, often need all threads to do their computation before processing next stage of computation

In parallel programming, we call this barrier synchronization

– all threads wait when they reach the barrier until all the threads have reached that point and then they are all released to continue


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

CUDA synchronization

CUDA provides a synchronization barrier routine for those threads within each block

__syncthreads()

This routine would be used within a kernel.

Threads would waits at this point until all threads in the block have reached it and they are all released.

NOTE only synchronizes with other threads in block


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Threads only synchronize with other threads in the block

Kernel code

__global void mykernel () {

.

.

.

__syncthreads();

.

.

.

}

Block n-1

Block 0

Barrier

Barrier

Continue

Continue

Separate barriers


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

__syncthreads() constraints

All threads must reach a particular __syncthreads() routine or deadlock occurs.

Multiple __syncthreads() can be used in a kernel but each one is unique. Hence cannot have:

if { ...

__syncthreads();

}

else { …

__syncthreads();

}

and expect threads going thro different paths to be synchronized.

They all must go through the if or all go through the else clause.


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Global Kernel Barrier

Unfortunately no global kernel barrier routine available in CUDA .

Often we want to synchronized all threads in computation.

To do that, have to use workarounds such as returning from kernel and placing a barrier in CPU code.

The following could be used in the CPU code:

myKernel<<<B,T>>>( … );

cudaThreadSynchronize();

which waits until all preceding commands in all “streams” have completed. cudaThreadSynchronize()not needed if there is an existing synchronous CUDA call such as cudaMemcpy().


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Achieving global synchronization through multiple kernel launches

Kernel launches efficiently implemented:

- Minimal hardware overhead

- Little software overhead

  • So could do:

  • for (i= 0; i < n; i++) {

  • myKernel<<<B,T>>>( … );

  • cudaThreadSynchronize();

  • }

    Recursion -- not allowed within kernel but can be used in host code to launch kernels


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Code Example

N-body problem

Need to compute forces on each body in each time interval and then update positions and velocities of bodies and then repeat.

for (t = 0; t < tmax; t++) { // for each time period, force calculation on all bodies

cudaMemcpy(dev_A, A ,arraySize,cudaMemcpyHostToDevice); // data to GPU

bodyCal<<<B,T>>>(dev_A);// kernel call

cudaMemcpy(A,dev_A,arraySize,cudaMemcpyDeviceToHost); // updated data

} // end of time period loop

No explicit synchronization needed as cudaMemcpy provides that here.


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Reasoning behind not having CUDA global synchronization within GPU

Expensive to implement for a large number of GPU processors.

At the block level, allows blocks to be executed in any order on GPU.

Can use different sizes of blocks depending upon the resources of GPU – so-called “transparent scalability.”


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Other ways to achieve global synchronization (if it cannot be avoided)

  • CUDA memory fence __threadfence() that waits to memory operations to be visible to other threads but probably is not useable for synchronization.

  • Write your own code for the kernel that implements global synchronization.

    How? (Using atomics and critical sections see next).


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Asynchronous CUDA routines

  • Control is returns before device ha scompled request tasked:

  • Kernel launches

  • Memory copies between two addresses in same device memory (Device to device memory copies)

  • Host to device memory copy (<= 64KB)

  • Memory copies with Async suffix

  • Memory set function calls

  • From “CUDA C Programming Guide” October 2012, page 29.


Itcs 4 5010 parallel programming b wilkinson jan 21 2013 cudasynchronization

Questions


  • Login