The fundamentals of gpu technology and cuda programming
1 / 54

The Fundamentals of GPU Technology and CUDA Programming - PowerPoint PPT Presentation

  • Uploaded on

The Fundamentals of GPU Technology and CUDA Programming. Nicholas Lykins Kentucky State University May 7, 2012. Outline. Introduction Why pursue GPU accelerated computing? Performance figures Historical background Graphics rendering pipeline History of GPU technology

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' The Fundamentals of GPU Technology and CUDA Programming' - misae

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The fundamentals of gpu technology and cuda programming

The Fundamentals of GPU Technology and CUDA Programming

Nicholas Lykins

Kentucky State University

May 7, 2012


  • Introduction

    • Why pursue GPU accelerated computing?

    • Performance figures

  • Historical background

    • Graphics rendering pipeline

    • History of GPU technology

    • NVIDIA and GPU implementations

    • Alternative GPU processing frameworks

  • CUDA

    • Background and available libraries

    • Terminology

    • Architectural design

    • Syntax

  • Hands-on CUDA sample demonstration

    • Line by line illustration of code execution

    • Animated execution pipeline for sample application

  • Conclusion and future outlook

Thesis guidelines
Thesis Guidelines

  • Initial goal: Demonstrate the potential for GPU technology to further enhance data processing needs of the scientific community.

  • Objectives

    • Deliver an account of the history of GPU technology

    • Provide an overview of NVIDIA’s CUDA framework

    • Demonstrate the motivation for scientists to pursue GPU acceleration and apply it to their own scientific disciplines

High performance computing
High-Performance Computing

  • Multi-Core Processing

  • GPU Acceleration

  • ….How are they different?

  • Hardware differences: CPU vs. GPU

Hardware review
Hardware Review

  • CPU (Single-Input, Single-Data)

    • Control unit, arithmetic and logic unit, internal registers, internal data bus

    • Speed limitations

    • One bit in, one bit out

  • GPU (Single-Input, Multiple-Data)

    • Many processing cores and onboard memory

    • Parallel execution of each core

    • One bit in, multiple bits out

Performance trends
Performance Trends

  • GPU processing time is measurably faster than comparable CPU processing time when working with large-scale input data.

Gpu technology pipeline overview
GPU Technology –Pipeline Overview

  • Graphics rendering pipeline

    • Entire process through which an image is generated by a graphics processing device

    • Vertex calculations

    • Color generation

    • Shadows and lighting

  • Shaders

    • Specialized program executed as a function of graphics processing hardware to produce a particular aspect of the resulting image

Traditional pipelining process
Traditional Pipelining Process

  • Traditional pipelining process

    • System collects data to be graphically represented

      • Modeling transformations within the world space

      • Vertices are “shaded” according to various properties

        • Lighting, materials, textures

      • Viewing transformation is performed – reorienting the graphical object with respect to the human eye

      • Clipping is performed, eliminating constructed content outside the frustum

Traditional pipelining continued
Traditional Pipelining, Continued

  • The three-dimensional scene is then rendered onto a two-dimensional viewing plane, or screen space

  • Rasterization takes place, in which the continuous geometric representation of objects is translated into a set of discrete fragments for a particular display

    • Color, transparency, and depth

  • Stored within the frame buffer, where Z-buffering and alpha blending take place, pixels are determined with respect to their appearance on the screen.

Graphics processing apis application programming interfaces
Graphics Processing APIs (Application Programming Interfaces)

  • OpenGL

    • OpenGL 1.0 first developed by Silicon Graphics in 1992.

    • First middle layer developed for interpreting between operating system and underlying hardware.

    • Industry-wide standard was implemented for graphics development, with each vendor crafting their hardware architecture with those standards in mind.

    • Cross-platform compatibility

  • DirectX

    • Developed by Microsoft employees Craig Eisler, Alex St. John, and Eric Engstrum in 1995, for facilitating low level access by programmers of Window’s restricted memory space.

    • Set of related APIs (Direct3D, DirectDraw, DirectSound) that enable multimedia development.

    • Vendor provides device driver that enables compatibility for its own hardware across all Windows systems.

    • Restricted to Windows only.

Geforce 256
GeForce Interfaces) 256

  • Released in August of 1999, it was the world’s first official GPU device.

  • Integration of all graphics processing actions onto a single chip.

  • Implemented with a fixed function rendering pipeline

Programmable pipeline
Programmable Pipeline Interfaces)

  • OpenGL 2.0

    • Programmable shaders

    • Programmers could write unique instructions for accessing hardware functionality

  • Programmability enabled by proprietary “shading languages”

    • ARB

      • Low-level assembly based language for directly interfacing with hardware elements

      • Unintuitive and difficult to use effectively

    • GLSL (OpenGL Shading Language)

      • High-level language derived from C

      • Translates high-level code into corresponding low-level instructions to be interpreted as ARB language

    • Cg

      • High-level shader language designed by NVIDIA

      • Compiles into assembly-based and GLSL code for interpretation by OpenGL and DirectX

Gt80 architecture
GT80 Architecture Interfaces)

  • Released in November of 2006, first implemented within the GeForce 8800.

  • First architecture to implement the CUDA framework, and first instance of a unified graphics rendering pipeline

    • Vertex and fragment shaders integrated as one hardware component

    • Programmability given over individual processing elements on the device

  • Scalability based on targeted consumer market

    • Proportions of processing cores, memory, etc.

Gt80 architecture continued1
GT80 Architecture, Continued Interfaces)

  • GeForce 8800 GTX

    • Each “tile” represents a separate multiprocessor

      • Eight streaming cores per multiprocessor, 16 multiprocessors per card

      • Shared L1 cache per pair of tiles

      • Texture handling units attached to each tile

    • Recursive method for handling graphics rendering

      • Output data for one core becomes input data for another

    • Six discrete memory partitions, each 64-bit, totalling to a 384-bit interface.

    • Bit interface and memory size varies based on specific GT80 device.

Fermi architecture
Fermi Architecture Interfaces)

  • Second generational GPU architecture, released in June of 2008.

  • Most recently featured architecture until the Kepler architecture was published, in March of 2012.

  • Rebranding of streaming processor cores as CUDA cores.

  • Overall superior design in terms of performance and computational precision

Fermi architecture continued
Fermi Architecture, Continued Interfaces)

  • Core count of 240, increased to 512.

  • 32 cores per multiprocessor, totaling 16 streaming multiprocessors

  • Similar memory interface to the GT80, hosting six 64-bit memory partitions totalling a 384-bit memory interface.

  • 64 KB shared memory per streaming multiprocessor

Fermi architecture continued1
Fermi Architecture, Continued Interfaces)

  • Unified memory address space: Thread, block, globally layered.

    • Enables a read and write mechanism compatible with C++ via pointer handling.

  • Configurable shared memory: 48 KB shared, 16 KB as L1 cache, vs. 48 KB L1 cache and 16 KB shared memory

  • L2 cache common across all streaming multiprocessors

Fermi architecture continued2
Fermi Architecture, Continued Interfaces)

  • Added CUDA compatibility with the implementation of PTX (Parallel Thread Execution) 2.0

  • Low level equivalent of assembly language

  • Low level virtual machine, responsible for translating system calls from the CPU, to hardware instructions interpretible by the GPU’s onboard hardware.

    • CUDA passes high level CUDA code to the compiler.

    • PTX translates it into corresponding low level code.

    • Hardware instructions are then interpreted based on that low level code and executed by the GPU itself.

Amd ati
AMD-ATI Interfaces)

  • Rival GPU manufacturer – develops its own proprietary line of graphics cards

  • Significant architectural differences with NVIDIA products

    • Evergreen chipset – ATI Radeon HD 5870 - Comparison

      • NVIDIA’s GTX 480 – 512 active cores, 3 billion transistors

      • Radeon HD 5870 – 20 parallel engines – 16 cores – 5 processing elements – totalling 1600 work units, 2.15 billion transistors

Parallel computing frameworks
Parallel Computing Frameworks Interfaces)

  • OpenCL

    • Parallel computing framework similar to CUDA

    • Initially introduced by Apple, but development of its standards currently done by the Khronos Group

    • Emphasis on portability and cross-platform implementations

    • Flagship parallel computing API of AMD

      • CPU/GPU, Apple systems, GPUs, etc.

      • Adopted by Intel, AMD, NVIDIA, ARM Holdings

  • CTM (Close to Metal)

    • Released in 2006 by AMD as a low level API providing hardware access, similar to NVIDIA’s PTX instruction set.

    • Discontinued in 2008, replaced by OpenCL for principal usage

CUDA Interfaces)

  • Programming framework by NVIDIA for performing GPGPU (General-Purpose GPU) computing

  • Potential for applying parallel processing capabilities of GPU hardware to traditional software applications

  • NVIDIA Libraries

    • Ready-made libraries for implementing complex computational functions

    • cuFFT (NVIDIA CUDA Fast Fourier Transform), cuBLAS (NVIDIA CUDA Basic Linear Algebra Subroutines), and cuSPARSE (NVIDIA CUDA Sparse)

Terminology what is cuda
Terminology - What is CUDA? Interfaces)

  • Hardware or software? ……or both.

    • Development framework that correlates between hardware elements on the GPU and the algorithms responsible for accessing and manipulating those elements

    • Expands on its original definition as a C-compatible compiler which special extensions for recognizing CUDA code

Scalability model
Scalability Model Interfaces)

  • Resource allocation is dynamically handled by the framework.

  • Scalable to different hardware devices without the need to recode an application.


Encapsulation and abstraction
Encapsulation and Abstraction Interfaces)

  • CUDA is designed as a high level API, hiding low level hardware details from the user.

  • Three major layers of abstraction between the architecture and the programmer:

    • Hierarchy of thread groups, shared memories, barrier synchronization

  • Computational features implemented as functions. Input data passed as parameters.

  • High level functionality allows for a low learning curve in terms of use.

  • Allows for applications to be run on any GPU card with a compatible architecture.

    • Backwards compatible for older versions.

Threading framework
Threading Framework Interfaces)

  • Resource allocation handled through threading

    • A thread represents a single work unit or operation

    • Lowest level of resource allocation for CUDA

  • Hierarchical structure

    • Threads, blocks, and grids; from lowest to highest

  • Paralleled to multiple layers of nested execution loops

Threading framework continued
Threading Framework, Continued Interfaces)

  • Visual representation of thread hierarchy

  • Multiple threads embedded in blocks, multiple blocks embedded in grids

  • Intuitive schemefor understanding allocation mechanism

Threading framework continued1
Threading Framework, Continued Interfaces)

  • Threading syntax

    • Recognized by the framework for handling thread usage within an application.

    • Each variable provides for tracking and monitoring of individual thread activity.

    • Resource assignment for an application not covered by these syntax elements.

Threading framework continued2
Threading Framework, Continued Interfaces)

  • Keywords

    • Threadidx .x/y/z– Represents the number of threads within a given block, three-dimensional.

    • blockIdx.x/y/z – Refers to a particular block within a grid, three-dimensional.

    • blockDim.x/y/z – Total number of threads allocated along a single dimension of a block, three-dimensional.

    • gridDim.x/y/z – Block count per dimension, three-dimensional

    • tid – Identifying marker for each individual thread; a unique value for each allocated thread

Threading framework continued3
Threading Framework, Continued Interfaces)

  • Flexibility for managing threads within an application.

    • Example: inttid = threadIdx.x + blockIdx.x * blockDim.x

    • Current block number, multiplied by number of threads per block, added to the current thread count.

    • Thread IDs are managed in this equation by mapping each value on a per-thread basis.

    • Simultaneous implementation of all thread IDs.

      • Parallel mapping of the equation across all threads as opposed to one thread at a time.

Sample thread allocation
Sample Thread Allocation Interfaces)

  • blockDim.x = 4

  • blockIdx.x = {0, 1, 2, 3…}

  • threadIdx.x = {0, 1, 2, 3}…{0, 1, 2, 3}…

  • idx/tid = blockDim.x * blockIdx.x + threadIdx.x

  • Problem size of ten operations, so two threads go to waste.

Thread incrementation
Thread Interfaces)Incrementation

  • Current scheme handles thread execution, but not subsequent incrementation of thread IDs.

  • Right and wrong way to increment threads, to avoid overflow into other allocated IDs.

  • Increment based on grid dimensions, not on block and thread counts

  • Example:

    • tid += blockDim.x * gridDim.x

  • Thread ID incremented by a multiple of threads per block and blocks per grid.

Compute capability
Compute Capability Interfaces)

  • Indicates structural limitations of hardware architectures

  • Determines various technical thresholds such as block and thread ceilings, etc.

  • Revision 1.x – Pre-Fermi architectures

  • Revision 2.x – Fermi architecture

Serial vs parallel distinction
Serial vs. Parallel Distinction Interfaces)

  • Host memory vs. device memory

    • Each platform has a separate memory space

    • Host can read and write to host only, device can read and write to device only

  • Synchronization needed between CPU and GPU activity

  • GPU only handles computationally intensive calculations – CPU still executes serial code

Serial vs parallel execution model
Serial vs. Parallel Execution Model Interfaces)

  • Application pipeline

    • Represents CPU and GPU activity

    • Illustrates behavior of application, and invocation of GPU computations

Memory architecture conceptual overview
Memory Architecture – Interfaces)Conceptual Overview

  • Three address spaces

    • Localized memory

      • Unique to each thread

    • Shared memory

      • Shared among threads within a particular block

    • Global memory

      • Accessible by threads and blocks across a given grid

Memory architecture hardware level
Memory Architecture – Interfaces)Hardware Level

  • More accurate representation of hardware level interaction between address spaces

  • Two new spaces: constant memory and texture memory

    • Constant memory is read-only and globally accessible.

    • Texture memory is a subset of global memory, useful in graphics rendering

      • Two-dimensionality

  • Surface memory

    • Similar functionality to texture memory but different technical elements

Memory allocation
Memory Allocation Interfaces)

  • Three basic steps of the allocation process

    • 1. Declare host and device memory allocations

    • 2. Copy input data from host memory to device memory

    • 3. Transfer processed data back to host upon completion

  • Bare memory requirements for successfully executing a GPU application

    • More sophisticated memory functions exist, but are geared towards more complex functionality and better performance

Memory handling syntax
Memory Handling Syntax Interfaces)

  • CUDA-specific keywords for dynamically allocating memory

    • cudaMalloc– Allocates a dynamic reference to a location in GPU memory. Identical in function to mallocin C.

    • cudaMemCpy– Transfers data from CPU memory to GPU memory. Also responsible for reversing the transfer.

    • cudaFree – Deallocates reference to GPU memory location. Identical to free in C.

  • Basic syntax needed for handling memory allocation

    • Additional features available for more sophisticated applications

Kernels Interfaces)

  • Kernel – Executes processing instructions for data loaded onto the GPU

    • Executes an operation N times for N threads simultaneously

    • Structured similarly to a normal function, but with its own unique changes

  • Kernel syntax

    • __global__ void example1<<<M, N>>>(A, B, C)

      • __global__ - Declaration specifier identifying a line as a GPU kernel.

      • Void example1 – Return type and kernel name

      • <<<M, N>>> - M represents number of threads to be allocated per block. N indicates number of blocks to set aside for executing the kernel.

      • (A, B, C) – Argument list to be passed to the kernel

Warps Interfaces)

  • During kernel execution, threads organized into warps.

    • A warp is a grouping of 32 threads, all executed in parallel with one another.

    • Threads are executed at the same program address, but mapped onto its own instruction counter and register state.

    • Allows parallel execution, but independent pacing of each thread in terms of completion.

  • Handling of threads in a warp is managed by a warp scheduler.

    • Two warp schedulers available per streaming multiprocessor

    • Warp execution optimized if no data dependence between threads.

    • Otherwise, dependent threads remain disabled till required data is received from completed operations

Thread synchronization
Thread Synchronization Interfaces)

  • Separation of threads between warps can cause data to get “tangled”.

    • Completed data does not coalesce back in memory as it should due to out of order warp execution.

  • Problem avoided by using __syncthreads()

    • Forcibly halts continued execution of a thread batch until all threads in a warp have reached completion.

    • Minimizes idle time for threads that finish early and ensures fewer errors in sensitive computations

Sample execution
Sample Execution Interfaces)

  • Animated visualization, indicating the relation between CPU and GPU elements

  • Sample code obtained from:Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011.

  • Highlights the activities needed to facilitate completion of a GPU-based data processing application.

  • Code Animation Link

Conclusion Interfaces)

  • Major topics covered:

    • Performance benefits of GPU accelerated applications.

    • Historical account of GPU technology and graphics processing.

    • Hands-on demonstration of CUDA, including syntax, architecture, and implementation.

Future outlook
Future Outlook Interfaces)

  • Promising future, with positive projected market demand for GPU technology

  • Growing market share for NVIDIA products

    • Gaming applications, scientific computing, and video editing and engineering purposes

  • Release of Kepler architecture – March 2012.

    • Indicates further increase in performance metrics and optimized resource consumption

    • Currently little documentation released in terms of technical specifications

  • Role of GPU technology is sure to continue saturating the professional market, as it’s capabilities continue to rise.

Bibliography Interfaces)

  • 1. Meyers, Michael.Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs. s.l. : McGraw-Hill Osborne Media, 2010.

  • 2. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2012.]

  • 3. Intel Corporation. Intel AVX. [Online] [Cited: February 21, 2012.]

  • 4. Performance Analysis of GPU compared to Single-core and Multi-core CPU for Natural Language Applications. Gupta, Shubham and Babu, M. Rajasekhara. 5, 2011, International Journal of Advanced Computer Science and Applications, Vol. 2, p. 4.

  • 5. IAP 2009 CUDA @ MIT / 6.963. [Online] January 2009. [Cited: February 7, 2012.]

  • 6. Palacios, Jonathan and Triska, Josh. A Comparison of Modern GPU and CPU Architectures: And the Common Convergence of Both. [Online] March 15, 2011. [Cited: February 21, 2012.]

  • 7. NVidia.NVidia's Next Generation CUDA Compute Architecture: Fermi. [Online] 2009. [Cited: February 21, 2012.]

  • 8. —. NVidia CUDA C Programming Guide 4.1. NVidia. [Online] November 18, 2011. [Cited: February 17, 2012.]

Bibliography continued
Bibliography, Continued Interfaces)

  • 9. Lillian, Peter.NVIDIA GPU'S Workshop. Lexington : s.n., January 17, 2012.

  • 10. Cutler, Barb. The Traditional Graphics Pipeline. Rensselaer Polytechnic Institute - Barb Cutler - Faculty Website. [Online] 2009. [Cited: February 12, 2012.]

  • 11. Thomson, Richard. The Direct3D Graphics Pipeline. Richard Thomson - Personal Website. [Online] 2006. [Cited: February 24, 2012.]

  • 12. Edwards, Benji. A Brief History of Computer Displays. [Online] November 1, 2010. [Cited: January 12, 2012.]

  • 13. Intel Corporation. ISBX 275 Video Graphics Controller Multimodule Board Reference Manual. [Online] 1982. [Cited: January 9, 2012.]

  • 14. Farrimond, Dorian. Technology that Changed Gaming #2: The Commodore Amiga. [Online] April 2011, 15. [Cited: January 23, 2012.]

  • 15. Silicon Graphics International Corporation. OpenGL Overview. [Online] 2009. [Cited: February 12, 2012.]

  • 16. Coding Unit. The History of DirectX. [Online] [Cited: February 22, 2012.]

  • .

Bibliography continued1
Bibliography, Continued Interfaces)

  • 17. NVidia.Geforce 256. [Online] [Cited: February 12, 2012.]

  • 18. Rost, Randi J.OpenGL Shading Language, Second Edition. s.l. : Addison Wesley Professional, 2006.

  • 19. Woo, Mason, et al.Opengl Programming Guide: The Official Guide to Learning OpenGL, Version 1.1. s.l. : Addison Wesley Publishing, 1997.

  • 20. Cg: A system for programming graphics hardware in a C-like language. William, Mark R., et al. 2003, ACM Transactions on Graphics, pp. 896-907.

  • 21. NVidia. The Cg Tutorial: Chapter 1. Introduction. [Online] April 20, 2011. [Cited: February 23, 2012.]

  • 22. —. NVidiaGeforce 8800 GPU Architecture Overview. [Online] November 2006. [Cited: February 12, 2012.]

  • 23. Kirk, David B. and Hwu, Wen-Mei W.Programming Massively Parallel Processors: A Hands-on Approach. s.l. : Morgan Kaufmann, 2010.

  • 24. NVidia. PTX: Parallel Thread Execution ISA Version 2.3. [Online] March 8, 2011. [Cited: February 27, 2012.]

Bibliography continued2
Bibliography, Continued Interfaces)

  • 25. AMD. AMD Radeon HD 5870. AMD. [Online] [Cited: February 27, 2012.]

  • 26. —. Heterogeneous Computing OpenCL and the ATI Radeon HD 5870 ("Evergreen") Architecture. AMD-ATI. [Online] [Cited: February 27, 2012.]

  • 27. Rosenberg, Ofer.OpenCL Overview. Khronos Group. [Online] November 2011. [Cited: February 28, 2012.]

  • 28. Khronos Group.OpenCL Overview. Khronos Group. [Online] [Cited: February 28, 2012.]

  • 29. AMD. GPGPU History. AMD. [Online] [Cited: February 28, 2012.]

  • 30. —. AMD "Close to Metal" Press Release. AMD. [Online] November 14, 2006. [Cited: February 28, 2012.]

  • 31. GPU-Accelerated Libraries. [Online] NVIDIA. [Cited: April 4, 2012.]

Bibliography continued3
Bibliography, Continued Interfaces)

  • 32. McGlaun, Shane.DailyTech. [Online] April 5, 2008. [Cited: February 12, 2012.]

  • 33. Farber, Rob. CUDA, Supercomputing for the Masses: Parrt 2. Dr.Dobbs - The World of Software Development. [Online] April 29, 2008. [Cited: April 5, 2012.]

  • 34. NVIDIA. NVIDIA GeForce GTX 680 Whitepaper. [Online] March 22, 2012. [Cited: April 7, 2012.]

  • 35. Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011.

  • 36. Lee, Hsien-Hsin Sean.Multicore And Programming for Video Games. Georgia Institute of Technology. [Online] October 5, 2008. [Cited: February 22, 2012.]

  • 37. Phillips, Jeff M. Introduction to and History of GPU Algorithms. The University of Utah - Models of Computation for Massive Data Course. [Online] November 9, 2011. [Cited: February 22, 2012.]

  • 38. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2011.]