1 / 95

Getting Started with GPU Computing

Getting Started with GPU Computing. Dan Negrut Assistant Professor Simulation-Based Engineering Lab Dept. of Mechanical Engineering University of Wisconsin-Madison. San Diego August 30, 2009. Acknowledgement. Colleagues helping to organize the GPU Workshop:

erno
Download Presentation

Getting Started with GPU Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting Started with GPU Computing Dan Negrut Assistant Professor Simulation-Based Engineering Lab Dept. of Mechanical Engineering University of Wisconsin-Madison San Diego August 30, 2009

  2. Acknowledgement • Colleagues helping to organize the GPU Workshop: • Sara McMains, Krishnan Suresh, RoshanD’Souza • Wen-mei W. Hwu • NVIDIA Corporation • My students • Hammad Mazhar • Toby Heyn 2

  3. Acknowledgements: Financial Support [Dan Negrut] • NSF • NVIDIA Corporation • British Aerospace Engineering (BAE), Land Division • Argonne National Lab 3

  4. Overview • Parallel computing: why, and why now? (15 mins) • GPU Programming: The democratization of parallel computing (60 mins) • NVIDIA’s CUDA, a facilitator of GPU computing • Comments on the execution configuration and execution model • The memory layout • Gauging resource utilization • IDE support • Comments on GPU computing (15 mins) • Sources of information • Beyond CUDA 4

  5. Scientific Computing: A Change of Tide... • A paradigm shift taking place in Scientific Computing • Moving from sequential to parallel data processing • Triggered by changes in the microprocessor industry 5

  6. CPU: Three Walls to Serial Performance • Memory Wall • Instruction Level Parallelism (ILP) Wall • Power Wall • Source: excellent article, “The Many-Core Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation • http://www.ctwatch.org/quarterly/articles/2007/02/the-many-core-inflection-point-for-mass-market-computer-systems/ 6

  7. Memory Wall • There is a growing disparity of speed between CPU and memory access outside the CPU chip • S. Cray: “Anyone can build a fast CPU. The trick is to build a fast system” 7

  8. Memory Wall • The processor often data starved (idle) due to latency and limited communication bandwidth beyond chip boundaries • From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory access speed only improved at 10%. • Some fixes • Strong push for ever growing caches to improve the average memory reference time to fetch or write data • Hyper-threading Technology (HTT) 8

  9. The Power Wall • “Power, and not manufacturing, limits traditional general purpose microarchitectureimprovements” (F. Pollack, Intel Fellow) • Leakage power dissipation gets worse as gates get smaller, because gate dielectric thicknesses must proportionately decrease Nuclear reactor W / cm2 Pentium 4 Pentium II Core DUO Pentium III Pentium i486 Pentium Pro i386 Adapted from F. Pollack (MICRO’99) 9 Technology from older to newer (μm)

  10. The Power Wall • Power dissipation in clocked digital devices is proportional to the square of clock frequency imposing natural limit on clock rates • Significant increase in clock speed without heroic (and expensive) cooling is not possible. Chips would simply melt. 10

  11. The Power Wall • Clock speed increased by a factor of 4,000 in less than two decades • The ability of manufacturers to dissipate heat is limited though… • Look back at the last five years, the clock rates are pretty much flat • 2010 Intel’s Sandy Bridge microprocessor architecture, to go up to 4.0 GHz 11

  12. The Bright Spot: Moore’s Law • 1965 paper: Doubling of the number of transistors on integrated circuits every two years • Moore himself wrote only about the density of components (or transistors) at minimum cost • Increase in transistor count to some extent as a rough measure of computer processing performance 12 http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html

  13. Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005 Intel’s Vision: Evolutionary Configurable Architecture Scalar plus many core for highly threaded workloads Large, Scalar cores for high single-thread performance • Many-core array • CMP with 10s-100s low power cores • Scalar cores • Capable of TFLOPS+ • Full System-on-Chip • Servers, workstations embedded… • Multi-core array • CMP with ~10 cores • Dual core • Symmetric multithreading Evolution Presentation Paul Petersen,Sr. Principal Engineer, Intel 13 CMP = “chip multi-processor”

  14. Putting things in perspective… Slide Source: Berkeley View of Landscape 14

  15. Some numbers would be good… 15

  16. GPU vs. CPU Flop Rate Comparison(singleprecision rate for GPU) Seymour Cray: "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?" 16

  17. Key ParametersGPU, CPU 17

  18. The GPU Hardware 18

  19. 19

  20. GPU: Underlying Hardware • NVIDIA nomenclature used below, reminiscent of GPU’s mission • The hardware organized as follows: • One Stream Processor Array (SPA)… • … has a collection of Texture Processor Clusters (TPC, ten of them on C1060) … • …and each TPC has three Stream Multiprocessors (SM) … • …and each SM is made up of eight Stream or Scalar Processor (SP) 20

  21. NVIDIA TESLA C1060 • 240 Scalar Processors • 4 GB device memory • Memory Bandwidth: 102 GB/s • Clock Rate: 1.3GHz • Approx. $1,250 21

  22. Layout of Typical Hardware Architecture CPU (the host) GPU w/ local DRAM (the device) 22

  23. GPGPU Computing • GPGPU computing: “General Purpose” GPU computing • The GPU can be used for more than just graphics: the computational resources are there, and they are most of the time underutilized • GPU can be used to accelerate data parallel parts of an application 23

  24. GPGPU: Pluses and Minuses • Simple architecture optimized for compute intensive task • Large data arrays, streaming throughput • Fine-grain SIMD (Singe Instruction Multiple Data) parallelism • Low-latency floating point (FP) computation • High precision floating point arithmetic support • 32bit floating point IEEE 754 • However, GPU was only programmable relying on graphics library APIs 24

  25. GPGPU: Pluses and Minuses [Cntd.] • Dealing with graphics API • Addressing modes • Limited texture size/dimension • Shader capabilities • Limited outputs • Instruction sets • Lack of Integer & bit ops • Communication limited • Between pixels • Only gather (can read data from other pixels), but no scatter (can only write to one pixel) per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory Summing Up: Mapping computation problems to graphics rendering pipeline tedious… 25

  26. CUDA: Addressing the Minuses in GPGPU • “Compute Unified DeviceArchitecture” • It represents a general purpose programming model • User kicks off batches of threads on the GPU • Targeted software stack • Scientific computing oriented drivers, language, and tools • Driver for loading computation programs into GPU • Standalone Driver - Optimized for computation • Interface designed for compute - graphics free API • Guaranteed maximum download & readback speeds • Explicit GPU memory management 26

  27. The CUDA Execution Model

  28. GPU Computing – The Basic Idea • The GPU is linked to the CPU by a reasonably fast connection • The idea is to use the GPU as a co-processor • Farm out big parallelizable tasks to the GPU • Keep the CPU busy with the control of the execution and “corner” tasks 28

  29. GPU Computing – The Basic Idea [Cntd.] • You have to copy data onto the GPU and later fetch results back. • For this to pay off, the data transfer should be overshadowed by the number crunching that draws on that data • GPUs also work in asynchronous mode • Data transfer for future task can happen while the GPU processes current job 29

  30. Some Nomenclature… • The HOST • This is your CPU executing the “master” thread • The DEVICE • This is the GPU card, connected to the HOST through a PCIe X16 connection • The HOST (the master thread) calls the DEVICE to execute a KERNEL • When calling the KERNEL, the HOST also has to inform the DEVICE how many threads should each execute the KERNEL • This is called “defining the execution configuration” 30

  31. Calling a Kernel Function, Details • A kernel function must be called with an execution configuration: __global__ void KernelFoo(...); // declaration dim3DimGrid(100, 50); // 5000 thread blocks dim3DimBlock(4, 8, 8); // 256 threads per block KernelFoo<<<DimGrid, DimBlock>>>(...arg list here…); • Any call to a kernel function is asynchronous • By default, execution on host doesn’t wait for kernel to finish 31

  32. Example • The host call below instructs the GPU to execute the function (kernel) “foo” using 25,600 threads • Two arguments are passed down to each thread executing the kernel “foo” • In this execution configuration, the host instructs the device that it is supposed to run 100 blocks each having 256 threads in it • The concept of block it’s important, since it represents the entity that gets executed by an SMs 32

  33. 30,000 Feet Perspective This is how the code gets executed on the hardware in heterogeneous computing This is how your C code looks like 33

  34. 34

  35. More on the Execution Model • There is a limitation on the number of blocks in a grid: • The grid of blocks can be organized as a 2D structure: max of 65535 by 65535 grid of blocks (that is, no more than 4,294,836,225 blocks for a kernel call) • Threads in each block: • The threads can be organized as a 3D structure (x,y,z) • The total number of threads in each block cannot be larger than 512 35

  36. Kernel Call Overhead • How much time is it burnt by the CPU calling the GPU? • Values reported below are averages over 100,000 kernel calls • No arguments in the kernel call • GT 8800 series, CUDA 1.1: 0.115305 milliseconds • Tesla C1060, CUDA 1.3: 0.088493 milliseconds • Arguments present in the kernel call • GT 8800 series, CUDA 1.1: 0.146812 milliseconds • Tesla C1060, CUDA 1.3: 0.116648 milliseconds 36

  37. Languages Supported in CUDA • Note that everything is done in C • Yet minor extensions are needed to flag the fact that a function actually represents a kernel, that there are functions that will only run on the device, etc. • Called “C with extensions” • FOTRAN is supported, ongoing project with the Portland Group (PGI) • There is support for C++ programming (operator overload, for instance) 37

  38. CUDA Function Declarations(the “C with extensions” part) • __global__ defines a kernel function • Must return void • For a full list, see CUDA Reference Manual 38

  39. Block Execution Scheduling Issues

  40. Who’s Executing Here?[The Stream Multiprocessor (SM)] • The SM represents the quantum of scalability on NVIDIA’s architecture • My laptop: 4 SMs • The Tesla C1060: 30 SMs • Stream Multiprocessor (SM) • 8 Scalar Processors (SP) • 2 Special Function Units (SFU) • It’s where a block lands for execution • Multi-threaded instruction dispatch • From 1 up to 1024 (!) threads active • Shared instruction fetch per 32 threads • 16 KB shared memory + 16 KB of registers • DRAM texture and memory access Stream Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP 40

  41. Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 2) Thread (0, 1) Thread (0, 0) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 2) Thread (4, 0) Scheduling on the Hardware • Grid is launched on the SPA • Thread Blocks are serially distributed to all the SMs • Potentially >1 Thread Block per SM • Each SM launches Warps of Threads • SM schedules and executes Warps that are ready to run • As Warps and Thread Blocks complete, resources are freed • SPA can launch next Block[s] in line • NOTE: Two levels of scheduling: • For running [desirably] a large number of blocks on a small number of SMs (16/14/etc.) • For running up to 32 warps of threads on the 8 SPs available on each SM 41

  42. MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm TF SM Executes Blocks SM 0 SM 1 Blocks • Threads are assigned to SMs in Block granularity • Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…) • One SM can take up to 1024 threads • This is 32 warps • Could be 256 (threads/block) * 4 blocks • Or 128 (threads/block) * 8 blocks, etc. • Threads run concurrently but time slicing is involved • SM assigns/maintains thread id #s • SM manages/schedules thread execution • There is NO time slicing for block execution Blocks Texture L1 L2 Memory 42

  43. t0 t1 t2 … t31 t0 t1 t2 … t31 Thread Scheduling/Execution • Each Thread Block is divided in 32-thread Warps • This is an implementation decision, not part of the CUDA programming model • Warps are the basic scheduling units in SM • If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? • Each Block is divided into 256/32 = 8 Warps • There are 8 * 3 = 24 Warps • At any point in time, only *one* of the 24 Warps will be selected for instruction fetch and execution. … Block 1 Warps … Block 2 Warps … … Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP 43 HK-UIUC

  44. warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 35 warp 8 instruction 12 warp 3 instruction 36 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • All threads in a Warp execute the same instruction when selected • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • Side-comment: • Suppose your code has one global memory access every four instructions • Then, a minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ... 44 HK-UIUC

  45. . . . . . . Review: The CUDA Programming Model • GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD) • What’s the overall software (application) development model? • CUDA integrated CPU + GPU application C program • Serial C code executes on CPU • Parallel Kernel C code executes on GPU thread blocks CPU Serial Code GPU Parallel Kernel KernelA<<< nBlkA, nTidA >>>(args); Grid 0 CPU Serial Code GPU Parallel Kernel KernelB<<< nBlkB, nTidB >>>(args); Grid 1 45

  46. The CPU perspective of the GPU… • The GPU is viewed as a computedevicethat: • Is a co-processor to the CPU or host • Runs many threadsin parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • When a kernel is invoked, you will have to instruct the GPU how many threads are supposed to run this kernel • You have to indicate the number of blocks of threads • You have to indicated how many threads are in each block 46

  47. Caveats [1] • Flop rates for GPUs are reported for single precision operations • Double precision is supported but the rule of thumb is that you get about a 4X slowdown relative to single precision • Also, some small deviations from IEEE754 exist • Combinations of multiplication and addition in one operation is not compliant 47

  48. Caveats [2] • There is no synchronization between threads that live in different blocks • If all threads need to synchronize, this is accomplished by getting out of the kernel and invoking another one • Average overhead for kernel launch ¼ 90-110 microseconds (small…) • IMPORTANT: Global, constant, and texture memory spaces are persistent across successive kernels calls made by the same application 48

  49. CUDA Memory Spaces 49

  50. The Memory Space • The memory space is the union of • Registers • Shared memory • Device memory, which can be • Global memory • Constant memory • Texture memory • Remarks • The constant memory is cached • The texture memory is cached • The global memory is NOT cached • Mem Bandwidth, Device Memory: • 102 Gb/s 50

More Related