Cuda lecture 4 cuda programming basics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 84

CUDA Lecture 4 CUDA Programming Basics PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on
  • Presentation posted in: General

CUDA Lecture 4 CUDA Programming Basics. Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Parallel Programming Basics. Things we need to consider: Control Synchronization Communication Parallel programming languages offer different ways of dealing with above.

Download Presentation

CUDA Lecture 4 CUDA Programming Basics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cuda lecture 4 cuda programming basics

CUDA Lecture 4CUDA Programming Basics

Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.


Parallel programming basics

Parallel Programming Basics

  • Things we need to consider:

    • Control

    • Synchronization

    • Communication

  • Parallel programming languages offer different ways of dealing with above

CUDA Programming Basics – Slide 2


Overview

Overview

  • CUDA programming model – basic concepts and data types

  • CUDA application programming interface - basic

  • Simple examples to illustrate basic concepts and functionalities

  • Performance features will be covered later

CUDA Programming Basics – Slide 3


Outline of cuda basics

Outline of CUDA Basics

  • Basic kernels and execution on GPU

  • Basic memory management

  • Coordinating CPU and GPU execution

  • See the programming guide for the full API

CUDA Programming Basics – Slide 4


Cuda c with no shader limitations

CUDA – C with no shader limitations!

  • Integrated host + device application program in C

    • Serial or modestly parallel parts in host C code

    • Highly parallel parts in device SPMD kernel C code

  • Programming model

    • Parallel code (kernel) is launched and executed on a device by many threads

    • Launches are hierarchical

      • Threads are grouped into blocks

      • Blocks are grouped into grids

    • Familiar serial code is written for a thread

      • Each thread is free to execute a unique code path

      • Built-in thread and block ID variables

CUDA Programming Basics – Slide 5


Cuda c with no shader limitations1

. . .

. . .

CUDA – C with no shader limitations!

Serial Code (host)‏

Parallel Kernel (device)‏

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)‏

Parallel Kernel (device)‏

KernelB<<< nBlk, nTid >>>(args);

CUDA Programming Basics – Slide 6


Cuda devices and threads

CUDA Devices and Threads

  • A computedevice

    • Is a coprocessor to the CPU or host

    • Has its own DRAM (device memory)‏

    • Runs many threads in parallel

    • Is typically a GPU but can also be another type of parallel processing device

  • Data-parallel portions of an application are expressed as device kernels which run on many threads

CUDA Programming Basics – Slide 7


Cuda devices and threads1

CUDA Devices and Threads

  • Differences between GPU and CPU threads

    • GPU threads are extremely lightweight

      • Very little creation overhead

    • GPU needs 1000s of threads for full efficiency

      • Multi-core CPU needs only a few

CUDA Programming Basics – Slide 8


G80 graphics mode

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TF

TF

TF

TF

TF

TF

TF

TF

L1

L1

L1

L1

L1

L1

L1

L1

Host

Input Assembler

Setup / Rstr / ZCull

Vtx Thread Issue

Geom Thread Issue

Pixel Thread Issue

Thread Processor

L2

L2

L2

L2

L2

L2

FB

FB

FB

FB

FB

FB

G80 – Graphics Mode

  • The future of GPUs is programmable processing

  • So – build the architecture around the processor

CUDA Programming Basics – Slide 9


G80 cuda mode a device example

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Host

Input Assembler

Thread Execution Manager

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store

Load/store

Load/store

Load/store

Load/store

Load/store

Global Memory

G80 CUDA Mode – A Device Example

  • Processors execute computing threads

  • New operating mode/hardware interface for computing

CUDA Programming Basics – Slide 10


High level view

High Level View

SMEM

SMEM

SMEM

SMEM

PCIe

Global Memory

CPU Chipset

CUDA Programming Basics – Slide 11


Blocks of threads run on a sm

Thread

Memory

Threadblock

Per-blockShared

Memory

Blocks of Threads Run on a SM

Streaming Processor

Streaming Multiprocessor

SMEM

Registers

Memory

CUDA Programming Basics – Slide 12


Whole grid runs on gpu

Many blocks of threads

. . .

Whole Grid Runs on GPU

SMEM

SMEM

SMEM

SMEM

Global Memory

CUDA Programming Basics – Slide 13


Extended c

Extended C

CUDA Programming Basics – Slide 14


Extended c1

Extended C

Mark Murphy, “NVIDIA’s Experience with Open64,”

www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc

Integrated source

(foo.cu)

cudacc

EDG C/C++ frontend

Open64 Global Optimizer

GPU Assembly

foo.s

CPU Host Code

foo.cpp

OCG

gcc / cl

G80 SASS

foo.sass

CUDA Programming Basics – Slide 15


Arrays of parallel threads

threadID

0

1

2

3

4

5

6

7

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Arrays of Parallel Threads

  • A CUDA kernel is executed by an array of threads

    • All threads run the same code (SPMD)

    • Each thread has an ID that it uses to compute memory addresses and make control decisions

CUDA Programming Basics – Slide 16


Thread blocks scalable cooperation

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

threadID

float x = input[threadID];

float y = func(x);

output[threadID] = y;

float x = input[threadID];

float y = func(x);

output[threadID] = y;

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Thread Blocks: Scalable Cooperation

  • Divide monolithic thread array into multiple blocks

    • Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

    • Threads in different blocks cannot cooperate

Thread Block 1

Thread Block N - 1

Thread Block 0

CUDA Programming Basics – Slide 17


Thread hierarchy

Thread Hierarchy

  • Threads launched for a parallel section are partitioned into thread blocks

    • Grid = all blocks for a given launch

  • Thread block is a group of threads that can

    • Synchronize their executions

    • Communicate via shared memory

CUDA Programming Basics – Slide 18


Blocks must be independent

Blocks Must Be Independent

  • Any possible interleaving of blocks should be valid

    • Presumed to run to completion without preemption

    • Can run in any order

    • Can run concurrently OR sequentially

  • Blocks may coordinate but not synchronize

    • Shared queue pointer: OK

    • Shared lock: BAD … can easily deadlock

  • Independence requirement gives scalability

CUDA Programming Basics – Slide 19


Basics of cuda programming

Basics of CUDA Programming

  • A CUDA program has two pieces

    • Host code on the CPU which interfaces to the GPU

    • Kernel code which runs on the GPU

  • At the host level, there is a choice of 2 APIs (Application Programming Interfaces):

    • Runtime: simpler, more convenient

    • Driver: much more verbose, more flexible, closer to OpenCL

  • We will only use the Runtime API in this course

CUDA Programming Basics – Slide 20


Basics of cuda programming1

Basics of CUDA Programming

  • At the host code level, there are library routines for:

    • memory allocation on graphics card

    • data transfer to/from device memory

      • constants

      • texture arrays (useful for lookup tables)

      • ordinary data

    • error-checking

    • timing

  • There is also a special syntax for launching multiple copies of the kernel process on the GPU.

CUDA Programming Basics – Slide 21


Block ids and thread ids

Block IDs and Thread IDs

  • Each thread uses IDs to decide what data to work on

    • Block ID: 1-D or 2-D

      • Unique within a block

    • Thread ID: 1-D, 2-D or 3-D

      • Unique within a block

  • Dimensions set at launch

    • Can be unique for each grid

CUDA Programming Basics – Slide 22


Block ids and thread ids1

Block IDs and Thread IDs

  • Built-in variables

    • threadIdx, blockIdx

    • blockDim, gridDim

  • Simplifies memory addressing when processing multidimensional data

    • Image processing

    • Solving PDEs on volumes

CUDA Programming Basics – Slide 23


Basics of cuda programming2

Basics of CUDA Programming

  • In its simplest form launch of kernel looks like:

    kernel_routine<<<gridDim, blockDim>>>(args);

    where

    • gridDim is the number of copies of the kernel (the “grid” size”)

    • blockDim is the number of threads within each copy (the “block” size)

    • args is a limited number of arguments, usually mainly pointers to arrays in graphics memory, and some constants which get copied by value

  • The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs

CUDA Programming Basics – Slide 24


Basics of cuda programming3

Basics of CUDA Programming

  • At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about:

    • some variables passed as arguments

    • pointers to arrays in device memory (also arguments)

    • global constants in device memory

    • shared memory and private registers/local variables

    • some special variables:

      • gridDim size (or dimensions) of grid of blocks

      • blockIdx index (or 2-D/3-D indices) of block

      • blockDim size (or dimensions) of each block

      • threadIdx index (or 2-D/3-D indices) of thread

CUDA Programming Basics – Slide 25


Basics of cuda programming4

Basics of CUDA Programming

  • Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed?

  • On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM

  • Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting for

    • data coming from device memory (memory latency)

    • completion of earlier instructions (pipeline delay)

  • Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps

CUDA Programming Basics – Slide 26


Basics of cuda programming5

Basics of CUDA Programming

  • In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block.

  • If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices

  • To launch the kernel we would use somthing like

    dim3 nthreads(16,4);

    my_new_kernel<<<nblocks,nthreads>>>(d_x);

    where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1.

CUDA Programming Basics – Slide 27


For example

For Example

  • Launch with

    dim3 dimGrid(2, 2);

    dim3 dimBlock(4, 2, 2);

    kernelFunc<<<dimGrid,dimBlock>>>(…);

  • Zoomed in on block with

    blockIdx.x = blockIdx.y = 1,

    blockDim.x = 4,

    blockDim.y = blockDim.z = 2

  • Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z)

CUDA Programming Basics – Slide 28


Basics of cuda programming6

Basics of CUDA Programming

  • A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications.

  • How do 2-D / 3-D threads get divided into warps?

    • 1-D thread ID defined by

      threadIdx.x +

      threadIdx.y * blockDim.x +

      threadIdx.z * blockDim.x * blockDim.y

      and this is then broken up into warps of size 32.

CUDA Programming Basics – Slide 29


Cuda memory model overview

CUDA Memory Model Overview

  • Global memory

    • Main means of communicating R/W data between host and device

    • Contents visible to all threads

    • Long latency access

  • We will focus on global memory for now

    • Constant and texture memory will come later

Grid

Block (0, 0)‏

Block (1, 0)‏

Shared Memory

Shared Memory

Registers

Registers

Registers

Registers

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Host

Global Memory

CUDA Programming Basics – Slide 30


Memory model

Kernel 0

Per-deviceGlobal

Memory

Sequential

Kernels

Kernel 1

. . .

. . .

Memory Model

CUDA Programming Basics – Slide 31


Cuda api highlights easy and lightweight

CUDA API Highlights: Easy and Lightweight

  • The API is an extension to the ANSI C programming language

    • Low learning curve

  • The hardware is designed to enable lightweight runtime and driver

    • High performance

CUDA Programming Basics – Slide 32


Memory spaces

Memory Spaces

  • CPU and GPU have separate memory spaces

    • Data is moved across the PCIe bus

    • Use functions to allocate/set/copy memory on GPU

      • Very similar to corresponding C functions

  • Pointers are just addresses

    • Can’t tell from the pointer value whether the address is on CPU or GPU

    • Must exercise care when dereferencing

      • Dereferencing CPU pointer on GPU will likely crash and vice-versa

CUDA Programming Basics – Slide 33


Cuda device memory allocation

CUDA Device Memory Allocation

  • cudaMalloc()

    • Allocates object in the device global memory

    • Requires two parameters

      • Address of a pointer to the allocated object

      • Size of allocated object

  • cudaFree()

    • Frees objects from device global memory

      • Pointer to freed object

Grid

Block (0, 0)‏

Block (1, 0)‏

Shared Memory

Shared Memory

Registers

Registers

Registers

Registers

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Host

Global Memory

CUDA Programming Basics – Slide 34


Cuda device memory allocation1

CUDA Device Memory Allocation

  • Code example

    • Allocate a 64-by-64 single precision float array

    • Attach the allocated storage to Md

      • “d” is often used to indicate a device data structure

CUDA Programming Basics – Slide 35


Cuda host device data transfer

CUDA Host-Device Data Transfer

  • cudaMemcpy()

    • Memory data transfer

    • Requires four parameters

      • Pointer to destination

      • Pointer to source

      • Number of bytes copied

      • Type of transfer

        • Host to host

        • Host to device

        • Device to host

        • Device to device

  • Asynchronous transfer

Grid

Block (0, 0)‏

Block (1, 0)‏

Shared Memory

Shared Memory

Registers

Registers

Registers

Registers

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Thread (0, 0)‏

Thread (1, 0)‏

Host

Global Memory

CUDA Programming Basics – Slide 36


Memory model1

Host memory

Device 0memory

cudaMemcpy()

Device 1memory

Memory Model

  • cudaMemcpy()

    • Returns after the copy is complete

    • Blocks CPU thread until all bytes have been copied

    • Doesn’t start copying until previous CUDA calls complete

  • Non-blocking copies are also available

CUDA Programming Basics – Slide 37


Cuda host device data transfer1

CUDA Host-Device Data Transfer

  • Code example

    • Transfer a 64-by-64 single precision float array

    • M is in host memory and Md is in device memory

    • cudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants

CUDA Programming Basics – Slide 38


First simple cuda example

First Simple CUDA Example

CUDA Programming Basics – Slide 39


Code executed on gpu

Code Executed on GPU

  • C/C++ with some restrictions

    • Can only access GPU memory

    • No variable number of arguments

    • No static variables

    • No recursion

    • No dynamic polymorphism

  • Must be declared with a qualifier

    • __global__ : launched by CPU, cannot be called from GPU

    • __device__ : called from other GPU functions, cannot be called by the CPU

    • __host__ : can be called by the CPU

CUDA Programming Basics – Slide 40


Cuda function declarations

CUDA Function Declarations

  • __global__ defines a kernel function

    • Must return void

  • __device__ and __host__ can be used together

    • Sample use: overloading operators

CUDA Programming Basics – Slide 41


Cuda function declarations1

CUDA Function Declarations

__device__ intreduction_lock = 0;

  • The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU.

  • The variable can be read and modified by any kernel

  • Its lifetime is the lifetime of the whole application

  • Can also declare arrays of fixed size

  • Can read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress

CUDA Programming Basics – Slide 42


Cuda function declarations2

CUDA Function Declarations

  • __device__ functions cannot have their address taken

  • For functions executed on the device

    • No recursion

    • No static variable declarations inside the function

    • No variable number of arguments

CUDA Programming Basics – Slide 43


Calling a kernel function thread creation

Calling a Kernel Function – Thread Creation

  • As seen a kernel function must be called with an execution configuration:

  • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

CUDA Programming Basics – Slide 44


Basics of cuda programming7

Basics of CUDA Programming

  • The kernel code looks fairly normal once you get used to two things:

    • code is written from the point of view of a single thread

      • quite different to OpenMP multithreading

      • similar to MPI, where you use the MPI “rank” to identify the MPI process

      • all local variables are private to that thread

    • need to think about where each variable lives

      • any operation involving data in the device memory forces its transfer to/from registers in the GPU

      • no cache on old hardware so a second operation with the same data will force a second transfer

      • often better to copy the value into a local register variable

CUDA Programming Basics – Slide 45


Next cuda example vector addition

Next CUDA Example: Vector Addition

CUDA Programming Basics – Slide 46


Next cuda example vector addition1

Next CUDA Example: Vector Addition

  • __global__ identifier says its a kernel function

  • Each thread sets one element of C[]array

  • Within each block of threads, threadIdx.x ranges from 0 to blockDim.x-1, so each thread has a unique value for i

CUDA Programming Basics – Slide 47


Kernel variations and output

Kernel Variations and Output

CUDA Programming Basics – Slide 48


Next cuda example kernel with 2 d addressing

Next CUDA Example: Kernel with 2-D Addressing

CUDA Programming Basics – Slide 49


A simple running example matrix multiplication

A Simple Running ExampleMatrix Multiplication

  • A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs

    • Leave shared memory usage until later

    • Local, register usage

    • Thread ID usage

    • Memory data transfer API between host and device

    • Assume square matrix for simplicity

CUDA Programming Basics – Slide 50


Programming model square matrix multiplication example

Programming ModelSquare Matrix Multiplication Example

N

  • P = M × N of size WIDTH-by-WIDTH

  • Without tiling

    • One thread calculates one element of P

    • M and N are loaded WIDTH times from global memory

WIDTH

M

P

WIDTH

WIDTH

WIDTH

CUDA Programming Basics – Slide 51


Memory layout of a matrix in c

Memory Layout of a Matrix in C

M0,0

M0,1

M0,2

M0,3

M1,0

M1,1

M1,2

M1,3

M2,0

M2,1

M2,2

M2,3

M3,0

M3,1

M3,2

M3,3

M

M0,0

M0,1

M0,2

M0,3

M1,0

M1,1

M1,2

M1,3

M2,0

M2,1

M2,2

M2,3

M3,0

M3,1

M3,2

M3,3

CUDA Programming Basics – Slide 52


Memory layout of a matrix in the textbook

Memory Layout of a Matrix in the Textbook

M0,0

M1,0

M2,0

M3,0

M0,1

M1,1

M2,1

M3,1

M0,2

M1,2

M2,2

M3,2

M0,3

M1,3

M2,3

M3,3

M

M0,0

M1,0

M2,0

M3,0

M0,1

M1,1

M2,1

M3,1

M0,2

M1,2

M2,2

M3,2

M0,3

M1,3

M2,3

M3,3


Step 1 matrix multiplication a simple host version in c

Step 1: Matrix MultiplicationA Simple Host Version in C

N

k

j

WIDTH

M

P

i

WIDTH

k

WIDTH

WIDTH

CUDA Programming Basics – Slide 54


Step 1 matrix multiplication a simple host version in c1

Step 1: Matrix MultiplicationA Simple Host Version in C

CUDA Programming Basics – Slide 55


Step 2 input matrix data transfer host side code

Step 2: Input Matrix Data Transfer(Host-Side Code)

CUDA Programming Basics – Slide 56


Step 3 output matrix data transfer host side code

Step 3: Output Matrix Data Transfer(Host-Side Code)

CUDA Programming Basics – Slide 57


Step 4 kernel function overview

Step 4: Kernel Function (Overview)

Nd

k

threadIdx.x

WIDTH

Md

Pd

threadIdx.y

threadIdx.y

WIDTH

k

threadIdx.x

WIDTH

WIDTH

CUDA Programming Basics – Slide 58


Step 4 kernel function

Step 4: Kernel Function

CUDA Programming Basics – Slide 59


Step 5 kernel invocation host side code

Step 5: Kernel Invocation(Host-Side Code)

CUDA Programming Basics – Slide 60


Only one thread block used

Only One Thread Block Used

  • One block of threads compute matrix Pd

    • Each thread computes one element of Pd

  • Each thread

    • Loads a row of matrix Md

    • Loads a column of matrix Nd

    • Performs one multiply and addition for each pair of Md and Nd elements

    • Compute to off-chip memory access ratio close to 1:1 (not very high)

  • Size of matrix limited by the number of threads allowed in a thread block

CUDA Programming Basics – Slide 61


Only one thread block used1

Only One Thread Block Used

Nd

Grid 1

Block 1

Thread

(2, 2)‏

48

WIDTH

Pd

Md

CUDA Programming Basics – Slide 62


Handling square matrices with arbitrary size

Handling Square Matrices with Arbitrary Size

  • Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrix

    • Each has (TILE_WIDTH)² threads

  • Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks

  • You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K)

CUDA Programming Basics – Slide 63


Matrix multiplication using multiple blocks

Matrix Multiplication Using Multiple Blocks

Nd

  • Break-up Pd into tiles

  • Each block calculates one tile

    • Each thread calculates one element

    • Block size equal tile size

WIDTH

Md

Pd

by

TILE_WIDTH

ty

WIDTH

bx

tx

WIDTH

WIDTH

CUDA Programming Basics – Slide 64


A small example multiplication

A Small Example: Multiplication

Nd0,0

Nd1,0

Block(0,0)

Block(1,0)

Nd0,1

Nd1,1

Pd0,0

Pd1,0

Pd2,0

Pd3,0

TILE_WIDTH = 2

Nd0,2

Nd1,2

Pd0,1

Pd1,1

Pd2,1

Pd3,1

Nd0,3

Nd1,3

Pd0,2

Pd1,2

Pd2,2

Pd3,2

Pd0,3

Pd1,3

Pd2,3

Pd3,3

Md0,0

Md1,0

Md2,0

Md3,0

Pd0,0

Pd1,0

Pd2,0

Pd3,0

Md0,1

Md1,1

Md2,1

Md3,1

Pd0,1

Pd1,1

Pd2,1

Pd3,1

Block(0,1)

Block(1,1)

Pd0,2

Pd1,2

Pd2,2

Pd3,2

Pd0,3

Pd1,3

Pd2,3

Pd3,3

CUDA Programming Basics – Slide 65


Revised matrix multiplication kernel using multiple blocks

Revised Matrix Multiplication Kernel Using Multiple Blocks

CUDA Programming Basics – Slide 66


Cuda thread block

CUDA Thread Block

  • All threads in a block execute the same kernel program (SPMD)

  • Programmer declares block:

    • Block size 1 to 512 concurrent threads

    • Block shape 1-D, 2-D, or 3-D

    • Block dimensions in threads

  • Threads have thread id numbers within block

    • Thread program uses thread id to select work and address shared data

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

CUDA Tools and Threads – Slide 67


Cuda thread block1

CUDA Thread Block

  • Threads in the same block share data and synchronize while doing their share of the work

  • Threads in different blocks cannot cooperate

    • Each block can execute in any order relative to other blocs!

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

CUDA Tools and Threads – Slide 68


Transparent scalability

Kernel grid

Device

Block 2

Block 6

Block 4

Block 0

Block 1

Block 5

Block 3

Block 7

Device

Block 7

Block 5

Block 1

Block 3

Block 4

Block 5

Block 6

Block 0

Block 6

Block 7

Block 2

Block 0

Block 1

Block 2

Block 3

Block 4

Transparent Scalability

  • Hardware is free to assign blocks to any processor at any time

    • A kernel scales across any number of parallel processors

    • Each block can execute in any order relative

      to other blocks

time

CUDA Tools and Threads – Slide 69


G80 cuda mode a review

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Host

Input Assembler

Thread Execution Manager

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store

Load/store

Load/store

Load/store

Load/store

Load/store

Global Memory

G80 CUDA Mode – A Review

  • Processors execute computing threads

  • New operating mode/hardware interface for computing

CUDA Tools and Threads – Slide 70


G80 example executing thread blocks

G80 Example: Executing Thread Blocks

  • Threads are assigned to streaming multiprocessors (SMs)in block granularity

    • Up to 8 blocks to each SM as resource allows

    • Each SM in G80 can take up to 768 threads

      • Could be 256 (threads/block) × 3 blocks

      • Or 128 (threads/block) × 6 blocks, etc.

  • Threads run concurrently

    • Each SM maintains thread/block id numbers

    • Each SM manages/schedules thread execution

CUDA Tools and Threads – Slide 71


G80 example executing thread blocks1

MT IU

MT IU

SP

SP

Shared

Memory

Shared

Memory

t0 t1 t2 … tm

t0 t1 t2 … tm

G80 Example: Executing Thread Blocks

SM 0

SM 1

Blocks

Blocks

Flexible resource allocation

CUDA Tools and Threads – Slide 72


G80 example thread scheduling

G80 Example: Thread Scheduling

  • Each block is executed as 32-thread warps

    • An implementation decision, not part of the CUDA programming model

    • Warps are scheduling units in an SM

  • If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?

    • Each block is divided into 256/32 = 8 warps

    • There are 8 × 3 = 24 warps

CUDA Tools and Threads – Slide 73


G80 example thread scheduling1

t0 t1 t2 … t31

t0 t1 t2 … t31

t0 t1 t2 … t31

G80 Example: Thread Scheduling

Block 1 Warps

Block 2 Warps

Block 1 Warps

Streaming Multiprocessor

Instruction L1

Instruction Fetch/Dispatch

Shared Memory

SP

SP

SP

SP

SFU

SFU

SP

SP

SP

SP

CUDA Tools and Threads – Slide 74


G80 example thread scheduling2

G80 Example: Thread Scheduling

  • Each SM implements zero-overhead warp scheduling

    • At any time, only one of the warps is executed by an SM

    • Warps whose next instruction has its operands ready for consumption are eligible for execution

    • Eligible warps are selected for execution on a prioritized scheduling policy

    • All threads in a warp execute the same instruction when selected

CUDA Tools and Threads – Slide 75


G80 block granularity considerations

G80 Block Granularity Considerations

  • For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks?

    • For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

    • For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

    • For 32 × 32, we have 1024 threads per Block. Not even one can fit into an SM!

CUDA Tools and Threads – Slide 76


Application programming interface

Application Programming Interface

  • The API is an extension to the C programming language

  • It consists of:

    • Language extensions

      • To target portions of the code for execution on the device

    • A runtime library split into:

      • A common component providing built-in vector types and a subset of the C runtime library in both host and device codes

      • A host component to control and access one or more devices from the host

      • A device component providing device-specific functions

CUDA Tools and Threads – Slide 77


Language extensions built in variables

Language Extensions: Built-in Variables

  • dim3 gridDim;

    • Dimensions of the grid in blocks (gridDim.z unused)

  • dim3 blockDim;

    • Dimensions of the block in threads

  • dim3 blockIdx;

    • Block index within the grid

  • dim3 threadIdx;

    • Thread index within the block

CUDA Tools and Threads – Slide 78


Common runtime component mathematical functions

Common Runtime Component: Mathematical Functions

  • pow,sqrt,cbrt,hypot

  • exp,exp2,expm1

  • log,log2,log10,log1p

  • sin,cos,tan,asin,acos,atan,atan2

  • sinh,cosh,tanh,asinh,acosh,atanh

  • ceil,floor,trunc,round

  • Etc.

    • When executed on the host, a given function uses the C runtime implementation if available

    • These functions are only supported for scalar types, not vector types

CUDA Tools and Threads – Slide 79


Common runtime component mathematical functions1

Common Runtime Component: Mathematical Functions

  • Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x))

    • __pow

    • __log,__log2,__log10

    • __exp

    • __sin,__cos,__tan

CUDA Tools and Threads – Slide 80


Host runtime component

Host Runtime Component

  • Provides functions to deal with:

    • Device management (including multi-device systems)

    • Memory management

    • Error handling

  • Initializes the first time a runtime function is called

  • A host thread can invoke device code on only one device

    • Multiple host threads required to run on multiple devices

CUDA Tools and Threads – Slide 81


Device runtime component synchronization function

Device Runtime Component:Synchronization Function

  • void __syncthreads();

  • Synchronizes all threads in a block

  • Once all threads have reached this point, execution resumes normally

  • Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory

  • Allowed in conditional constructs only if the conditional is uniform across the entire thread block

CUDA Tools and Threads – Slide 82


Final thoughts

Final Thoughts

  • memory allocation

    cudaMalloc((void **)&xd, nbytes);

  • data copying

    cudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost);

  • reminder: d (h) to distinguish an array on the device (host) is not mandatory, just helpful labeling

  • kernel routine is declared by __global__ prefix, and is written from point of view of a single thread

CUDA Programming Basics – Slide 83


End credits

End Credits

  • Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu.

  • Based on original material from

    • The University of Illinois at Urbana-Champaign

      • David Kirk, Wen-mei W. Hwu

    • Oxford University: Mike Giles

    • Stanford University

      • Jared Hoberock, David Tarjan

  • Revision history: last updated 6/22/2011.

CUDA Programming Basics – Slide 84


  • Login