CUDA Lecture 5 CUDA at the University of Akron

CUDA Lecture 5CUDA at the University of Akron Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Overview: CUDA Equipment • Your own PCs running G80 emulators • Better debugging environment • Sufficient for the first couple of weeks • Your own PCs with a CUDA-enabled GPU • NVIDIA boards in department • GeForce family of processors for high-performance gaming • Tesla C2070 for high-performance computing – no graphics output (?) and more memory CUDA at the University of Akron – Slide 2

Summary: NVIDIA Technology CUDA at the University of Akron – Slide 3

Hardware View, Consumer Procs. • Basic building block is a “streaming multiprocessor” • different chips have different numbers of these SMs: CUDA at the University of Akron – Slide 4

Hardware View, 2nd Generation • Basic building block is a “streaming multiprocessor” with • 8 cores, each with 2048 registers • up to 128 threads per core • 16KB of shared memory • 8KB cache for constants held in device memory • different chips have different numbers of these SMs: CUDA at the University of Akron – Slide 5

Hardware View, Fermi • each streaming multiprocessor has • 32 cores, each with 1024 registers • up to 48 threads per core • 64KB of shared memory / L1 cache • 8KB cache for constants held in device memory • there’s also a unified 384KB L2 cache • different chips again have different numbers of SMs: CUDA at the University of Akron – Slide 6

Different Compute Capabilities CUDA at the University of Akron – Slide 7

Different Compute Capabilities CUDA at the University of Akron – Slide 8

Common Technical Specifications CUDA at the University of Akron – Slide 9

Different Technical Specifications CUDA at the University of Akron – Slide 10

Overview: CUDA Components • CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment: • based on C with some extensions • C++ support increasing steadily • FORTRAN support provided by PGI compiler • lots of example code and good documentation – 2-4 week learning curve for those with experience of OpenMP and MPI programming • large user community on NVIDIA forums CUDA at the University of Akron – Slide 13

Overview: CUDA Components • When installing CUDA on a system, there are 3 components: • driver • low-level software that controls the graphics card • usually installed by sys-admin • toolkit • nvcc CUDA compiler • some profiling and debugging tools • various libraries • usually installed by sys-admin in /usr/local/cuda CUDA at the University of Akron – Slide 14

Overview: CUDA Components • SDK • lots of demonstration examples • a convenient Makefile for building applications • some error-checking utilities • not supported by NVIDIA • almost no documentation • often installed by user in own directory CUDA at the University of Akron – Slide 15

Accessing the Tesla Card • Remotely access the front end: ssh tesla.cs.uakron.edu • ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network CUDA at the University of Akron – Slide 16

Accessing the Tesla Card • The first time you do this: • After login, run /root/gpucomputingsdk_3.2.16_linux.run and just take the default answers to get your own personal copy of the SDK. • Then:cd ~/NVIDIA_GPU_Computing_SDK/C make -j12 -kwill build all that can be built. CUDA at the University of Akron – Slide 17

Accessing the Tesla Card • The first time you do this: • Binaries end up in:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release • In particular header file <cutil_inline.h>is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc • Can then get a summary of technical specs and compute capabilities by executing ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery CUDA at the University of Akron – Slide 18

CUDA Makefile • Two choices: • use nvcc within a standard Makefile • use the special Makefile template provided in the SDK • The SDK Makefile provides some useful options: • make emu=1 • uses an emulation library for debugging on a CPU • make dbg=1 • activates run-time error checking • In general just use a standard Makefile CUDA at the University of Akron – Slide 19

Sample Tesla Makefile CUDA at the University of Akron – Slide 20

Compiling a CUDA Program • Parallel Thread Execution (PTX) • Virtual machine and ISA • Programming model • Execution resources and state CUDA Tools and Threads – Slide 2

Compilation • Any source file containing CUDA extensions must be compiled with NVCC • NVCC is a compiler driver • Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … • NVCC outputs • C code (host CPU code) • Must then be compiled with the rest of the application using another tool • PTX • Object code directly, or PTX source interpreted at runtime CUDA Tools and Threads – Slide 22

Linking • Any executable with CUDA code requires two dynamic libraries • The CUDA runtime library (cudart) • The CUDA core library (cuda) CUDA Tools and Threads – Slide 23

Debugging Using the Device Emulation Mode • An executable compiled in device emulation mode (nvcc –deviceemu) runs completely on the host using the CUDA runtime • No need of any device and CUDA driver • Each device thread is emulated with a host thread CUDA Tools and Threads – Slide 24

Debugging Using the Device Emulation Mode • Running in device emulation mode, one can • Use host native debug support (breakpoints, inspection, etc.) • Access any device-specific data from host code and vice-versa • Call any host function from device code (e.g. printf) and vice-versa • Detect deadlock situations caused by improper usage of __syncthreads CUDA Tools and Threads – Slide 25

Device Emulation Mode Pitfalls • Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results • Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode CUDA Tools and Threads – Slide 26

Floating Point • Results of floating-point computations will slightly differ because of • Different compiler outputs, instructions sets • Use of extended precision for intermediate results • There are various options to force strict single precision on the host CUDA Tools and Threads – Slide 27

Nexus • New Visual Studio Based GPU Integrated Development • http://developer.nvidia.com/object/nexus.html • Available in Beta (as of October 2009) CUDA Tools and Threads – Slide 28

End Credits • Based on original material from • http://en.wikipedia.com/wiki/CUDA, accessed 6/22/2011. • The University of Akron: Charles Van Tilburg • The University of Illinois at Urbana-Champaign • David Kirk, Wen-mei W. Hwu • Oxford University: Mike Giles • Stanford University • Jared Hoberock, David Tarjan • Revision history: last updated 6/23/2011. CUDA at the University of Akron – Slide 29

CUDA Lecture 5 CUDA at the University of Akron

CUDA Lecture 5 CUDA at the University of Akron

Presentation Transcript

Cuda

CUDA Lecture 2 History of GPUs

CUDA

CUDA

CUDA Lecture 8 CUDA Memories

CUDA Lecture 7 CUDA Threads and Atomics

CUDA

CUDA Lecture 4 CUDA Programming Basics

CUDA Programming

CUDA

CUDA 5.0

Lecture 8: CUDA

CUDA

CUDA