The fundamentals of gpu technology and cuda programming
Sponsored Links
This presentation is the property of its rightful owner.
1 / 54

The Fundamentals of GPU Technology and CUDA Programming PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

The Fundamentals of GPU Technology and CUDA Programming. Nicholas Lykins Kentucky State University May 7, 2012. Outline. Introduction Why pursue GPU accelerated computing? Performance figures Historical background Graphics rendering pipeline History of GPU technology

Download Presentation

The Fundamentals of GPU Technology and CUDA Programming

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The fundamentals of gpu technology and cuda programming

The Fundamentals of GPU Technology and CUDA Programming

Nicholas Lykins

Kentucky State University

May 7, 2012



  • Introduction

    • Why pursue GPU accelerated computing?

    • Performance figures

  • Historical background

    • Graphics rendering pipeline

    • History of GPU technology

    • NVIDIA and GPU implementations

    • Alternative GPU processing frameworks

  • CUDA

    • Background and available libraries

    • Terminology

    • Architectural design

    • Syntax

  • Hands-on CUDA sample demonstration

    • Line by line illustration of code execution

    • Animated execution pipeline for sample application

  • Conclusion and future outlook

Thesis guidelines

Thesis Guidelines

  • Initial goal: Demonstrate the potential for GPU technology to further enhance data processing needs of the scientific community.

  • Objectives

    • Deliver an account of the history of GPU technology

    • Provide an overview of NVIDIA’s CUDA framework

    • Demonstrate the motivation for scientists to pursue GPU acceleration and apply it to their own scientific disciplines

High performance computing

High-Performance Computing

  • Multi-Core Processing

  • GPU Acceleration

  • ….How are they different?

  • Hardware differences: CPU vs. GPU

Hardware review

Hardware Review

  • CPU (Single-Input, Single-Data)

    • Control unit, arithmetic and logic unit, internal registers, internal data bus

    • Speed limitations

    • One bit in, one bit out

  • GPU (Single-Input, Multiple-Data)

    • Many processing cores and onboard memory

    • Parallel execution of each core

    • One bit in, multiple bits out

Performance trends

Performance Trends

  • GPU processing time is measurably faster than comparable CPU processing time when working with large-scale input data.

Performance trends continued

Performance Trends, Continued

Gpu technology pipeline overview

GPU Technology –Pipeline Overview

  • Graphics rendering pipeline

    • Entire process through which an image is generated by a graphics processing device

    • Vertex calculations

    • Color generation

    • Shadows and lighting

  • Shaders

    • Specialized program executed as a function of graphics processing hardware to produce a particular aspect of the resulting image

Traditional pipelining process

Traditional Pipelining Process

  • Traditional pipelining process

    • System collects data to be graphically represented

      • Modeling transformations within the world space

      • Vertices are “shaded” according to various properties

        • Lighting, materials, textures

      • Viewing transformation is performed – reorienting the graphical object with respect to the human eye

      • Clipping is performed, eliminating constructed content outside the frustum

Traditional pipelining continued

Traditional Pipelining, Continued

  • The three-dimensional scene is then rendered onto a two-dimensional viewing plane, or screen space

  • Rasterization takes place, in which the continuous geometric representation of objects is translated into a set of discrete fragments for a particular display

    • Color, transparency, and depth

  • Stored within the frame buffer, where Z-buffering and alpha blending take place, pixels are determined with respect to their appearance on the screen.

Graphics processing apis application programming interfaces

Graphics Processing APIs (Application Programming Interfaces)

  • OpenGL

    • OpenGL 1.0 first developed by Silicon Graphics in 1992.

    • First middle layer developed for interpreting between operating system and underlying hardware.

    • Industry-wide standard was implemented for graphics development, with each vendor crafting their hardware architecture with those standards in mind.

    • Cross-platform compatibility

  • DirectX

    • Developed by Microsoft employees Craig Eisler, Alex St. John, and Eric Engstrum in 1995, for facilitating low level access by programmers of Window’s restricted memory space.

    • Set of related APIs (Direct3D, DirectDraw, DirectSound) that enable multimedia development.

    • Vendor provides device driver that enables compatibility for its own hardware across all Windows systems.

    • Restricted to Windows only.

Geforce 256

GeForce 256

  • Released in August of 1999, it was the world’s first official GPU device.

  • Integration of all graphics processing actions onto a single chip.

  • Implemented with a fixed function rendering pipeline

Opengl 1 x fixed pipeline

OpenGL 1.x Fixed Pipeline

Programmable pipeline

Programmable Pipeline

  • OpenGL 2.0

    • Programmable shaders

    • Programmers could write unique instructions for accessing hardware functionality

  • Programmability enabled by proprietary “shading languages”

    • ARB

      • Low-level assembly based language for directly interfacing with hardware elements

      • Unintuitive and difficult to use effectively

    • GLSL (OpenGL Shading Language)

      • High-level language derived from C

      • Translates high-level code into corresponding low-level instructions to be interpreted as ARB language

    • Cg

      • High-level shader language designed by NVIDIA

      • Compiles into assembly-based and GLSL code for interpretation by OpenGL and DirectX

Programmable pipeline continued

Programmable Pipeline, Continued

Gt80 architecture

GT80 Architecture

  • Released in November of 2006, first implemented within the GeForce 8800.

  • First architecture to implement the CUDA framework, and first instance of a unified graphics rendering pipeline

    • Vertex and fragment shaders integrated as one hardware component

    • Programmability given over individual processing elements on the device

  • Scalability based on targeted consumer market

    • Proportions of processing cores, memory, etc.

Gt80 architecture continued

GT80 Architecture, Continued

Gt80 architecture continued1

GT80 Architecture, Continued

  • GeForce 8800 GTX

    • Each “tile” represents a separate multiprocessor

      • Eight streaming cores per multiprocessor, 16 multiprocessors per card

      • Shared L1 cache per pair of tiles

      • Texture handling units attached to each tile

    • Recursive method for handling graphics rendering

      • Output data for one core becomes input data for another

    • Six discrete memory partitions, each 64-bit, totalling to a 384-bit interface.

    • Bit interface and memory size varies based on specific GT80 device.

Discrete vs unified architecture

Discrete vs. Unified Architecture

Fermi architecture

Fermi Architecture

  • Second generational GPU architecture, released in June of 2008.

  • Most recently featured architecture until the Kepler architecture was published, in March of 2012.

  • Rebranding of streaming processor cores as CUDA cores.

  • Overall superior design in terms of performance and computational precision

Fermi architecture continued

Fermi Architecture, Continued

  • Core count of 240, increased to 512.

  • 32 cores per multiprocessor, totaling 16 streaming multiprocessors

  • Similar memory interface to the GT80, hosting six 64-bit memory partitions totalling a 384-bit memory interface.

  • 64 KB shared memory per streaming multiprocessor

Fermi architecture continued1

Fermi Architecture, Continued

  • Unified memory address space: Thread, block, globally layered.

    • Enables a read and write mechanism compatible with C++ via pointer handling.

  • Configurable shared memory: 48 KB shared, 16 KB as L1 cache, vs. 48 KB L1 cache and 16 KB shared memory

  • L2 cache common across all streaming multiprocessors

Fermi architecture continued2

Fermi Architecture, Continued

  • Added CUDA compatibility with the implementation of PTX (Parallel Thread Execution) 2.0

  • Low level equivalent of assembly language

  • Low level virtual machine, responsible for translating system calls from the CPU, to hardware instructions interpretible by the GPU’s onboard hardware.

    • CUDA passes high level CUDA code to the compiler.

    • PTX translates it into corresponding low level code.

    • Hardware instructions are then interpreted based on that low level code and executed by the GPU itself.

Amd ati


  • Rival GPU manufacturer – develops its own proprietary line of graphics cards

  • Significant architectural differences with NVIDIA products

    • Evergreen chipset – ATI Radeon HD 5870 - Comparison

      • NVIDIA’s GTX 480 – 512 active cores, 3 billion transistors

      • Radeon HD 5870 – 20 parallel engines – 16 cores – 5 processing elements – totalling 1600 work units, 2.15 billion transistors

Parallel computing frameworks

Parallel Computing Frameworks

  • OpenCL

    • Parallel computing framework similar to CUDA

    • Initially introduced by Apple, but development of its standards currently done by the Khronos Group

    • Emphasis on portability and cross-platform implementations

    • Flagship parallel computing API of AMD

      • CPU/GPU, Apple systems, GPUs, etc.

      • Adopted by Intel, AMD, NVIDIA, ARM Holdings

  • CTM (Close to Metal)

    • Released in 2006 by AMD as a low level API providing hardware access, similar to NVIDIA’s PTX instruction set.

    • Discontinued in 2008, replaced by OpenCL for principal usage

The fundamentals of gpu technology and cuda programming


  • Programming framework by NVIDIA for performing GPGPU (General-Purpose GPU) computing

  • Potential for applying parallel processing capabilities of GPU hardware to traditional software applications

  • NVIDIA Libraries

    • Ready-made libraries for implementing complex computational functions

    • cuFFT (NVIDIA CUDA Fast Fourier Transform), cuBLAS (NVIDIA CUDA Basic Linear Algebra Subroutines), and cuSPARSE (NVIDIA CUDA Sparse)

Terminology what is cuda

Terminology - What is CUDA?

  • Hardware or software? ……or both.

    • Development framework that correlates between hardware elements on the GPU and the algorithms responsible for accessing and manipulating those elements

    • Expands on its original definition as a C-compatible compiler which special extensions for recognizing CUDA code

Scalability model

Scalability Model

  • Resource allocation is dynamically handled by the framework.

  • Scalable to different hardware devices without the need to recode an application.


Encapsulation and abstraction

Encapsulation and Abstraction

  • CUDA is designed as a high level API, hiding low level hardware details from the user.

  • Three major layers of abstraction between the architecture and the programmer:

    • Hierarchy of thread groups, shared memories, barrier synchronization

  • Computational features implemented as functions. Input data passed as parameters.

  • High level functionality allows for a low learning curve in terms of use.

  • Allows for applications to be run on any GPU card with a compatible architecture.

    • Backwards compatible for older versions.

Threading framework

Threading Framework

  • Resource allocation handled through threading

    • A thread represents a single work unit or operation

    • Lowest level of resource allocation for CUDA

  • Hierarchical structure

    • Threads, blocks, and grids; from lowest to highest

  • Paralleled to multiple layers of nested execution loops

Threading framework continued

Threading Framework, Continued

  • Visual representation of thread hierarchy

  • Multiple threads embedded in blocks, multiple blocks embedded in grids

  • Intuitive schemefor understanding allocation mechanism

Threading framework continued1

Threading Framework, Continued

  • Threading syntax

    • Recognized by the framework for handling thread usage within an application.

    • Each variable provides for tracking and monitoring of individual thread activity.

    • Resource assignment for an application not covered by these syntax elements.

Threading framework continued2

Threading Framework, Continued

  • Keywords

    • Threadidx .x/y/z– Represents the number of threads within a given block, three-dimensional.

    • blockIdx.x/y/z – Refers to a particular block within a grid, three-dimensional.

    • blockDim.x/y/z – Total number of threads allocated along a single dimension of a block, three-dimensional.

    • gridDim.x/y/z – Block count per dimension, three-dimensional

    • tid – Identifying marker for each individual thread; a unique value for each allocated thread

Threading framework continued3

Threading Framework, Continued

  • Flexibility for managing threads within an application.

    • Example: inttid = threadIdx.x + blockIdx.x * blockDim.x

    • Current block number, multiplied by number of threads per block, added to the current thread count.

    • Thread IDs are managed in this equation by mapping each value on a per-thread basis.

    • Simultaneous implementation of all thread IDs.

      • Parallel mapping of the equation across all threads as opposed to one thread at a time.

Sample thread allocation

Sample Thread Allocation

  • blockDim.x = 4

  • blockIdx.x = {0, 1, 2, 3…}

  • threadIdx.x = {0, 1, 2, 3}…{0, 1, 2, 3}…

  • idx/tid = blockDim.x * blockIdx.x + threadIdx.x

  • Problem size of ten operations, so two threads go to waste.

Thread incrementation

Thread Incrementation

  • Current scheme handles thread execution, but not subsequent incrementation of thread IDs.

  • Right and wrong way to increment threads, to avoid overflow into other allocated IDs.

  • Increment based on grid dimensions, not on block and thread counts

  • Example:

    • tid += blockDim.x * gridDim.x

  • Thread ID incremented by a multiple of threads per block and blocks per grid.

Compute capability

Compute Capability

  • Indicates structural limitations of hardware architectures

  • Determines various technical thresholds such as block and thread ceilings, etc.

  • Revision 1.x – Pre-Fermi architectures

  • Revision 2.x – Fermi architecture

Serial vs parallel distinction

Serial vs. Parallel Distinction

  • Host memory vs. device memory

    • Each platform has a separate memory space

    • Host can read and write to host only, device can read and write to device only

  • Synchronization needed between CPU and GPU activity

  • GPU only handles computationally intensive calculations – CPU still executes serial code

Serial vs parallel execution model

Serial vs. Parallel Execution Model

  • Application pipeline

    • Represents CPU and GPU activity

    • Illustrates behavior of application, and invocation of GPU computations

Memory architecture conceptual overview

Memory Architecture –Conceptual Overview

  • Three address spaces

    • Localized memory

      • Unique to each thread

    • Shared memory

      • Shared among threads within a particular block

    • Global memory

      • Accessible by threads and blocks across a given grid

Memory architecture hardware level

Memory Architecture –Hardware Level

  • More accurate representation of hardware level interaction between address spaces

  • Two new spaces: constant memory and texture memory

    • Constant memory is read-only and globally accessible.

    • Texture memory is a subset of global memory, useful in graphics rendering

      • Two-dimensionality

  • Surface memory

    • Similar functionality to texture memory but different technical elements

Memory allocation

Memory Allocation

  • Three basic steps of the allocation process

    • 1. Declare host and device memory allocations

    • 2. Copy input data from host memory to device memory

    • 3. Transfer processed data back to host upon completion

  • Bare memory requirements for successfully executing a GPU application

    • More sophisticated memory functions exist, but are geared towards more complex functionality and better performance

Memory handling syntax

Memory Handling Syntax

  • CUDA-specific keywords for dynamically allocating memory

    • cudaMalloc– Allocates a dynamic reference to a location in GPU memory. Identical in function to mallocin C.

    • cudaMemCpy– Transfers data from CPU memory to GPU memory. Also responsible for reversing the transfer.

    • cudaFree – Deallocates reference to GPU memory location. Identical to free in C.

  • Basic syntax needed for handling memory allocation

    • Additional features available for more sophisticated applications



  • Kernel – Executes processing instructions for data loaded onto the GPU

    • Executes an operation N times for N threads simultaneously

    • Structured similarly to a normal function, but with its own unique changes

  • Kernel syntax

    • __global__ void example1<<<M, N>>>(A, B, C)

      • __global__ - Declaration specifier identifying a line as a GPU kernel.

      • Void example1 – Return type and kernel name

      • <<<M, N>>> - M represents number of threads to be allocated per block. N indicates number of blocks to set aside for executing the kernel.

      • (A, B, C) – Argument list to be passed to the kernel



  • During kernel execution, threads organized into warps.

    • A warp is a grouping of 32 threads, all executed in parallel with one another.

    • Threads are executed at the same program address, but mapped onto its own instruction counter and register state.

    • Allows parallel execution, but independent pacing of each thread in terms of completion.

  • Handling of threads in a warp is managed by a warp scheduler.

    • Two warp schedulers available per streaming multiprocessor

    • Warp execution optimized if no data dependence between threads.

    • Otherwise, dependent threads remain disabled till required data is received from completed operations

Thread synchronization

Thread Synchronization

  • Separation of threads between warps can cause data to get “tangled”.

    • Completed data does not coalesce back in memory as it should due to out of order warp execution.

  • Problem avoided by using __syncthreads()

    • Forcibly halts continued execution of a thread batch until all threads in a warp have reached completion.

    • Minimizes idle time for threads that finish early and ensures fewer errors in sensitive computations

Sample execution

Sample Execution

  • Animated visualization, indicating the relation between CPU and GPU elements

  • Sample code obtained from:Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011.

  • Highlights the activities needed to facilitate completion of a GPU-based data processing application.

  • Code Animation Link



  • Major topics covered:

    • Performance benefits of GPU accelerated applications.

    • Historical account of GPU technology and graphics processing.

    • Hands-on demonstration of CUDA, including syntax, architecture, and implementation.

Future outlook

Future Outlook

  • Promising future, with positive projected market demand for GPU technology

  • Growing market share for NVIDIA products

    • Gaming applications, scientific computing, and video editing and engineering purposes

  • Release of Kepler architecture – March 2012.

    • Indicates further increase in performance metrics and optimized resource consumption

    • Currently little documentation released in terms of technical specifications

  • Role of GPU technology is sure to continue saturating the professional market, as it’s capabilities continue to rise.



  • 1. Meyers, Michael.Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs. s.l. : McGraw-Hill Osborne Media, 2010.

  • 2. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2012.]

  • 3. Intel Corporation. Intel AVX. [Online] [Cited: February 21, 2012.]

  • 4. Performance Analysis of GPU compared to Single-core and Multi-core CPU for Natural Language Applications. Gupta, Shubham and Babu, M. Rajasekhara. 5, 2011, International Journal of Advanced Computer Science and Applications, Vol. 2, p. 4.

  • 5. IAP 2009 CUDA @ MIT / 6.963. [Online] January 2009. [Cited: February 7, 2012.]

  • 6. Palacios, Jonathan and Triska, Josh. A Comparison of Modern GPU and CPU Architectures: And the Common Convergence of Both. [Online] March 15, 2011. [Cited: February 21, 2012.]

  • 7. NVidia.NVidia's Next Generation CUDA Compute Architecture: Fermi. [Online] 2009. [Cited: February 21, 2012.]

  • 8. —. NVidia CUDA C Programming Guide 4.1. NVidia. [Online] November 18, 2011. [Cited: February 17, 2012.]

Bibliography continued

Bibliography, Continued

  • 9. Lillian, Peter.NVIDIA GPU'S Workshop. Lexington : s.n., January 17, 2012.

  • 10. Cutler, Barb. The Traditional Graphics Pipeline. Rensselaer Polytechnic Institute - Barb Cutler - Faculty Website. [Online] 2009. [Cited: February 12, 2012.]

  • 11. Thomson, Richard. The Direct3D Graphics Pipeline. Richard Thomson - Personal Website. [Online] 2006. [Cited: February 24, 2012.]

  • 12. Edwards, Benji. A Brief History of Computer Displays. [Online] November 1, 2010. [Cited: January 12, 2012.]

  • 13. Intel Corporation. ISBX 275 Video Graphics Controller Multimodule Board Reference Manual. [Online] 1982. [Cited: January 9, 2012.]

  • 14. Farrimond, Dorian. Technology that Changed Gaming #2: The Commodore Amiga. [Online] April 2011, 15. [Cited: January 23, 2012.]

  • 15. Silicon Graphics International Corporation. OpenGL Overview. [Online] 2009. [Cited: February 12, 2012.]

  • 16. Coding Unit. The History of DirectX. [Online] [Cited: February 22, 2012.]

  • .

Bibliography continued1

Bibliography, Continued

  • 17. NVidia.Geforce 256. [Online] [Cited: February 12, 2012.]

  • 18. Rost, Randi J.OpenGL Shading Language, Second Edition. s.l. : Addison Wesley Professional, 2006.

  • 19. Woo, Mason, et al.Opengl Programming Guide: The Official Guide to Learning OpenGL, Version 1.1. s.l. : Addison Wesley Publishing, 1997.

  • 20. Cg: A system for programming graphics hardware in a C-like language. William, Mark R., et al. 2003, ACM Transactions on Graphics, pp. 896-907.

  • 21. NVidia. The Cg Tutorial: Chapter 1. Introduction. [Online] April 20, 2011. [Cited: February 23, 2012.]

  • 22. —. NVidiaGeforce 8800 GPU Architecture Overview. [Online] November 2006. [Cited: February 12, 2012.]

  • 23. Kirk, David B. and Hwu, Wen-Mei W.Programming Massively Parallel Processors: A Hands-on Approach. s.l. : Morgan Kaufmann, 2010.

  • 24. NVidia. PTX: Parallel Thread Execution ISA Version 2.3. [Online] March 8, 2011. [Cited: February 27, 2012.]

Bibliography continued2

Bibliography, Continued

  • 25. AMD. AMD Radeon HD 5870. AMD. [Online] [Cited: February 27, 2012.]

  • 26. —. Heterogeneous Computing OpenCL and the ATI Radeon HD 5870 ("Evergreen") Architecture. AMD-ATI. [Online] [Cited: February 27, 2012.]

  • 27. Rosenberg, Ofer.OpenCL Overview. Khronos Group. [Online] November 2011. [Cited: February 28, 2012.]

  • 28. Khronos Group.OpenCL Overview. Khronos Group. [Online] [Cited: February 28, 2012.]

  • 29. AMD. GPGPU History. AMD. [Online] [Cited: February 28, 2012.]

  • 30. —. AMD "Close to Metal" Press Release. AMD. [Online] November 14, 2006. [Cited: February 28, 2012.]

  • 31. GPU-Accelerated Libraries. [Online] NVIDIA. [Cited: April 4, 2012.]

Bibliography continued3

Bibliography, Continued

  • 32. McGlaun, Shane.DailyTech. [Online] April 5, 2008. [Cited: February 12, 2012.]

  • 33. Farber, Rob. CUDA, Supercomputing for the Masses: Parrt 2. Dr.Dobbs - The World of Software Development. [Online] April 29, 2008. [Cited: April 5, 2012.]

  • 34. NVIDIA. NVIDIA GeForce GTX 680 Whitepaper. [Online] March 22, 2012. [Cited: April 7, 2012.]

  • 35. Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011.

  • 36. Lee, Hsien-Hsin Sean.Multicore And Programming for Video Games. Georgia Institute of Technology. [Online] October 5, 2008. [Cited: February 22, 2012.]

  • 37. Phillips, Jeff M. Introduction to and History of GPU Algorithms. The University of Utah - Models of Computation for Massive Data Course. [Online] November 9, 2011. [Cited: February 22, 2012.]

  • 38. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2011.]

  • Login