Parallel programming with cuda l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Parallel Programming with CUDA PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on
  • Presentation posted in: General

Parallel Programming with CUDA. Matthew Guidry Charles McClendon. Introduction to CUDA. CUDA is a platform for performing massively parallel computations on graphics accelerators CUDA was developed by NVIDIA It was first available with their G8X line of graphics cards

Download Presentation

Parallel Programming with CUDA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel programming with cuda l.jpg

Parallel Programming with CUDA

Matthew Guidry

Charles McClendon


Introduction to cuda l.jpg

Introduction to CUDA

  • CUDA is a platform for performing massively parallel computations on graphics accelerators

  • CUDA was developed by NVIDIA

  • It was first available with their G8X line of graphics cards

  • Approximately 1 million CUDA capable GPUs are shipped every week

  • CUDA presents a unique opportunity to develop widely-deployed parallel applications


Slide3 l.jpg

CUDA

  • Because of the Power Wall, Latency Wall, etc (free lunch is over), we must find a way to keep our processor intensive programs from slowing down to a crawl

  • With CUDA developments it is possible to do things like simulating Networks of Brain Neurons

  • CUDA brings the possibility of ubiquitous supercomputing to the everyday computer…


Slide4 l.jpg

CUDA

  • CUDA is supported on all of NVIDIA’s G8X and above graphics cards

  • The current CUDA GPU Architecture is branded Tesla

  • 8-series GPUs offer 50-200 GFLOPS


Cuda compilation l.jpg

CUDA Compilation

  • As a programming model, CUDA is a set of extensions to ANSI C

  • CPU code is compiled by the host C compiler and the GPU code (kernel) is compiled by the CUDA compiler. Separate binaries are produced


Cuda stack l.jpg

CUDA Stack


Limitations of cuda l.jpg

Limitations of CUDA

  • Tesla does not fully support IEEE spec for double precision floating point operations

  • Code only supported on NVIDIA hardware

  • No use of recursive functions (can workaround)

  • Bus latency between host CPU and GPU

    (Although double precision will be resolved with Fermi)


Thread hierarchy l.jpg

Thread Hierarchy

Thread – Distributed by the CUDA runtime

(identified by threadIdx)

Warp – A scheduling unit of up to 32 threads

Block – A user defined group of 1 to 512 threads.

(identified by blockIdx)

Grid – A group of one or more blocks. A grid is created for each CUDA kernel function


Cuda memory hierarchy l.jpg

CUDA Memory Hierarchy

  • The CUDA platform has three primary memory types

    Local Memory – per thread memory for automatic variables and register spilling.

    Shared Memory – per block low-latency memory to allow for intra-block data sharing and synchronization. Threads can safely share data through this memory and can perform barrier synchronization through _ _syncthreads()

    Global Memory – device level memory that may be shared between blocks or grids


Moving data l.jpg

Moving Data…

CUDA allows us to copy data from one memory type to another.

This includes dereferencing pointers, even in the host’s memory (main system RAM)

To facilitate this data movement CUDA provides cudaMemcpy()


Optimizing code for cuda l.jpg

Optimizing Code for CUDA

  • Prevent thread starvation by breaking your problem down (128 execution units are available for use, thousands of threads may be in flight)

  • Utilize shared memory and avoid latency problems (communicating with system memory is slow)

  • Keep in mind there is no built-in way to synchronize threads in different blocks

  • Avoid thread divergence in warps by blocking threads with similar control paths


Code example l.jpg

Code Example

Will be explained more in depth later…


Kernel functions l.jpg

Kernel Functions

  • A kernel function is the basic unit of work within a CUDA thread

  • Kernel functions are CUDA extensions to ANSI C that are compiled by the CUDA compiler and the object code generator


Kernel limitations l.jpg

Kernel Limitations

  • There must be no recursion; there’s no call stack

  • There must no static variable declarations

  • Functions must have a non-variable number of arguments


Cuda warp l.jpg

CUDA Warp

  • CUDA utilizes SIMT (Single Instruction Multiple Thread)

  • Warps are groups of 32 threads. Each warp receives a single instruction and “broadcasts” it to all of its threads.

  • CUDA provides “zero-overhead” warp and thread scheduling. Also, the overhead of thread creation is on the order of 1 clock.

  • Because a warp receives a single instruction, it will diverge and converge as each thread branches independently


Cuda hardware l.jpg

CUDA Hardware

  • The primary components of the Tesla architecture are:

    • Streaming Multiprocessor (The 8800 has 16)

    • Scalar Processor

    • Memory hierarchy

    • Interconnection network

    • Host interface


Streaming multiprocessor sm l.jpg

Streaming Multiprocessor (SM)

  • - Each SM has 8 Scalar Processors (SP)

  • IEEE 754 32-bit floating point support (incomplete support)

  • - Each SP is a 1.35 GHz processor (32 GFLOPS peak)

  • - Supports 32 and 64 bit integers

  • - 8,192 dynamically partitioned 32-bit registers

  • - Supports 768 threads in hardware (24 SIMT warps of 32 threads)

  • Thread scheduling done in hardware

  • 16KB of low-latency shared memory

  • 2 Special Function Units (reciprocal square root, trig functions, etc)

Each GPU has 16 SMs…


The gpu l.jpg

The GPU


Scalar processor l.jpg

Scalar Processor

  • Supports 32-bit IEEE floating point instructions:

    FADD, FMAD, FMIN, FMAX, FSET, F2I, I2F

  • Supports 32-bit integer operations

    IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I, SHR, SHL, AND, OR, XOR

  • Fully pipelined


Code example revisited l.jpg

Code Example: Revisited


Myths about cuda l.jpg

Myths About CUDA

  • GPUs are the only processors in a CUDA application

    • The CUDA platform is a co-processor, using the CPU and GPU

  • GPUs have very wide (1000s) SIMD machines

    • No, a CUDA Warp is only 32 threads

  • Branching is not possible on GPUs

    • Incorrect.

  • GPUs are power-inefficient

    • Nope, performance per watt is quite good

  • CUDA is only for C or C++ programmers

    • Not true, there are third party wrappers for Java, Python, and more


Different types of cuda applications l.jpg

Different Types of CUDA Applications


Future developments of cuda l.jpg

Future Developments of CUDA

  • The next generation of CUDA, called “Fermi,” will be the standard on the GeForce 300 series

  • Fermi will have full support IEEE 754 double precision

  • Fermi will natively support more programming languages

  • Also, there is a new project, OpenCL that seeks to provide an abstraction layer over CUDA and similar platforms (AMD’s Stream)


Things to ponder l.jpg

Things to Ponder…

  • Is CUDA better than Cell??

  • How do I utilize 12,000 threads??

  • Is CUDA really relevant anyway, in world where web applications are so popular??


Parallel programming with cuda25 l.jpg

“Parallel Programming with CUDA”

By: Matthew Guidry

Charles McClendon


  • Login