1 / 50

Lecture 2: Introduction to Parallel Computing Using CUDA

Lecture 2: Introduction to Parallel Computing Using CUDA. IEEE Boston Continuing Education Program. Ken Domino, Domem Technologies May 9, 2011. Announcements. Course website updates: Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx

varsha
Download Presentation

Lecture 2: Introduction to Parallel Computing Using CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2: Introduction to Parallel Computing Using CUDA IEEE Boston Continuing Education Program Ken Domino, Domem Technologies May 9, 2011

  2. Announcements • Course website updates: Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx Lecture1– http://domemtech.com/ieee-pp/Lecture1.pptx Lecture2– http://domemtech.com/ieee-pp/Lecture2.pptx References- http://domemtech.com/ieee-pp/References.docx • Ocelot April 5 download is not working

  3. PRAM • Parallel Random Access Machine (PRAM). • Idealized SIMD parallel computing model. • Unlimited RAM’s, called Processing Units (PU). • RAM’s operate with same instructions and synchronously. • Shared Memory unlimited, accessed in one unit time. • Shared Memory access is one of CREW, CRCW, EREW. • Communication between RAM’s is only through Shared Memory.

  4. PRAM pseudo code • Parallel for loop • for Pi , 1 ≤ i ≤ n inparallel do… end • (aka “data-level parallelism)

  5. Synchronization • A simple example from C:

  6. Synchronization • What happens if we have two threads competing for the same resources (char_in/char_out)?

  7. Synchronization • What happens if two threads execute this code serially? No prob!

  8. Synchronization • What happens if two threads execute this code in parallel? We can sometimes get a problem. char_in of T2 overwrites char_in of T1!

  9. Synchronization • Synchronization forces thread serialization, e.g., so concurrent access does not cause problems.

  10. Synchronization • Two types: • Mutual exclusion, using a “mutex” semaphore = a lock • Cooperation, wait on an object until all other threads ready, using wait() + notify(), barrier synchronization

  11. Deadlock • The use of mutual exclusion of two or more resources.

  12. PRAM Synchronization • ”stay idle” – wait until other processors complete, ”cooperative” synchronization

  13. CUDA • “Compute Unified Device Architecture” • Developed by NVIDIA, introduced November 2006 • Based on C, extended later to work with C++. • CUDA provides three key abstractions: • a hierarchy of thread groups • shared memories • barrier synchronization http://www.nvidia.com/object/IO_37226.html, http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf, Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.

  14. GPU coprocessor to CPU

  15. NVIDIA GPU Architecture Multiprocessor (MP) = texture/processor cluster (TPC) Dynamic random-access memory (DRAM) aka “global memory” Raster operation processor (ROP) L2 – Level-2 memory cache

  16. NVIDIA GPU Architecture Streaming Multiprocessor  (SM) Streaming processor (SP) Streaming multiprocessor control (SMC) Texture processing unit (TPU) Con Cache – “constant” memory Sh. Memory – “shared” memory Multithreaded instruction fetch and issue unit (MTIFI) 1st generation, G80 – 2006 3rd generation, Fermi, GTX 570 - 2010

  17. Single-instruction, multiple-thread • “SIMT” • SIMT = SIMD + SPMD (single program, multiple data). • Multiple threads. • Sort of “Single Instruction”—except that each instruction executed is in multiple independent parallel threads. • Instruction set architecture: a register-based instruction set including floating-point, integer, bit, conversion, transcendental, flow control, memory load/store, and texture operations.

  18. Single-instruction, multiple-thread • The Stream Multiprocessor is a hardware multithreaded unit. • Threads are executed in groups of 32 parallel threads called warps. • Each thread has its own set of registers. • Individual threads composing a warp are of the same program and start together at the same program address, but they are otherwise free to branch and execute independently.

  19. Single-instruction, multiple-thread • Instruction executed is same for each warp. • If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken.

  20. Single-instruction, multiple-thread • Warps are serialized if there is: • Divergence in instructions (i.e., conditional branch instruction) • write access to the same memory

  21. Warp Scheduling • SM hardware implements near-zero overhead • Warp scheduling • Warps whose next instruction has its operands ready for consumption can be executed • Eligible Warps are selected for execution by priority • All threads in a Warp execute the same instruction • 4 clock cycles needed to dispatch the instruction for all threads (G80)

  22. Cooperative Thread Array (CTA) • An abstraction to synchronizing threads • AKA a thread block, grid • CTA’s are mapped to warps

  23. Cooperative Thread Array (CTA) • Each thread has a unique integer thread ID (TID). • Threads of a CTA share data in global or shared memory • Threads synchronize with the barrier instruction. • CTA thread programs use their TIDs to select work and index shared data arrays.

  24. Cooperative Thread Array (CTA) • The programmer declares a 1D, 2D, or 3D grid shape and dimensions in threads. • The TID is 1D, 2D, or 3D indice.

  25. Restrictions in grid sizes

  26. Kernel • Every thread in a grid executes the same body of instructions, called a kernel. • In CUDA, it’s just a function.

  27. CUDA Kernels • Kernels declared with __global__ void • Parameters are the same for all threads. __global__ void fun(float * d, int size) { intidx = threadIdx.x + blockDim.x * blockIdx.x + blockDim.x * gridDim.x * blockDim.y * blockIdx.y + blockDim.x * gridDim.x * threadIdx.y; if (idx < 0) return; if (idx >= size) return; d[idx] = idx * 10.0 / 0.1; }

  28. CUDA Kernels • Kernels are called via “chevron syntax” • Func<<< Dg, Db, Ns, S >>>(parameters) • Dg is of type dim3 and specifies the dimension and size of the grid • Db is of type dim3 and specifies the dimension and size of the block • Dg is of type dim3 and specifies the dimension and size of the grid • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block • S is of type cudaStream_t and specifies the associated stream • Kernel is void type; must return value through cbv parameter • Example: • Foo<<<1, 100>>(1, 2, i);

  29. Memory • CTA’s have various types of memory • Global, shared, constant, textured, registers • Threads can access host memory, too.

  30. Types of memory

  31. CUDA Memory • Data types (int, long, float, double, etc) are the same as in the host. • Shared memory shared between blocks in a thread. • Global memory shared by all threads in all blocks. • Constant memory shared by all threads in all blocks, but it cannot be changed (so, faster). • Host memory (of CPU) can be access by all threads in all blocks.

  32. Shared Memory • __shared__ declares a variable that: • Resides in the shared memory space of a thread block, • Has the lifetime of the block, • Is only accessible from all the threads within the block. • Examples: • extern __shared__ float shared[]; • (or declared on kernel call—later!)

  33. Global Memory • __device__ declares a variable that: • Resides in global memory space; • Has the lifetime of an application; • Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) • Can be allocated through cudaMalloc() • Examples: • extern __device__ int data[100]; • cudaMalloc(&d, 100*sizeof(int));

  34. Basic host function calls • Global memory allocation via cudaMalloc() • Copying memory between host and GPU via cudaMemcpy() • Kernels are called by chevron syntax

  35. Counting 6’s • Have an array of integers, h[], want to count the number of 6’s that appear in the array. • H[0..size-1] • How do we do this in CUDA?

  36. Counting 6’s • Divide the array into blocks of blocksize threads. • For each block, sum the number of times 6 appears. • Return the sum for each block.

  37. Counting 6’s • Divide the array into blocks of blocksize threads. • For each block, sum the number of times 6 appears. • Return the sum for each block. • #include <stdio.h> • __global__ void c6(int * d_in, int * d_out, int size) • { • int sum = 0; • for (inti=0; i < blockDim.x; i++) • { • int val = d_in[i + blockIdx.x * blockDim.x]; • if (val == 6) • sum++; • } • d_out[blockIdx.x] = sum; • }

  38. Counting 6’s • int main() • { • int size = 300; • int * h = (int*)malloc(size * sizeof(int)); • for (int i = 0; i < size; ++i) • h[i] = i % 10; • int * d_in; • int * d_out; • intbsize = 100; • int blocks = size/bsize + 1; • intthreads_per_block = bsize; • int rv1 = cudaMalloc(&d_in, size*sizeof(int)); • int rv2 = cudaMalloc(&d_out, blocks*sizeof(int)); • int rv3 = cudaMemcpy(d_in, h, size*sizeof(int), cudaMemcpyHostToDevice); • c6<<<blocks, threads_per_block>>>(d_in, d_out, size); • cudaThreadSynchronize(); • int rv4 = cudaGetLastError(); • int * r = (int*)malloc(blocks * sizeof(int)); • int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int), cudaMemcpyDeviceToHost); • int sum = 0; • for (inti = 0; i < blocks; ++i) • sum += r[i]; • printf("Result = %d\n", sum); • return 0; • } • In main program, call the kernel with the correct dimensions of the block. • Note: size % blocksize = 0. • How would we extend this for arbitrary array size?

  39. Developing CUDA programs • Install CUDA SDK (drivers, Toolkit, examples) • Windows, Linux, Mac: • Use Version 4.0, release candidate 2. (The older 3.2 release does not work with VS2010 easily! You can install both VS2010 and VS2008, but you will have to manage paths.) • http://developer.nvidia.com/cuda-toolkit-40 • Install toolkit, tools SDK, and example code • For drivers, you must have an NVIDIA GPU card • Recommendation: The CUDA examples use definitions in a common library—do not force your code to depend on it by using it.

  40. Developing CUDA programs • Emulation • Do not install CUDA drivers (will fail). • Windows and Mac only • Install VirtualBox. • Create 40GB virtual drive. • Install Ubuntu from ISO image on VirtualBox. • Install Ocelot (http://code.google.com/p/gpuocelot/downloads/list) • Install various dependencies (sudo apt-get xxxx install, for g++, boost, etc.) • Note: There is a problem with the current release of Ocelot—I emailed Gregory.Diamos@gatech.edu to resolve build issue.

  41. Developing CUDA programs • Windows: • Install VS2010 C++ Express (http://www.microsoft.com/visualstudio/en-us/products/2010-editions/visual-cpp-express) • (Test installation with “Hello World” .cpp example.)

  42. Developing CUDA programs • Windows: • Create an empty c++ console project • Create hw.cu “hello world” program in source directory • Project ‐> Custom Build Rules, check box for CUDA 4.0 targets • Add hw.cu into your empty project • Note: “.cu” suffix stands for “CUDA source code”. You can put CUDA syntax into .cpp files, but build environment won’t know what to compile it with (cl/g++ vsnvcc).

  43. Developing CUDA programs #include <stdio.h> __global__ void fun(int * mem) { *mem = 1; } int main() { int h = 0; int * d; cudaMalloc(&d, sizeof(int)); cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice); fun<<<1,1>>>(d); cudaThreadSynchronize(); intrv = cudaGetLastError(); cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost); printf("Result = %d\n", h); return 0; } hw.cu:

  44. Developing CUDA programs • Compile, link, and run • (Version 4.0 installation adjusts all environmental variables.)

  45. NVCC • nvcc (NVIDIA CUDA compiler) is a driver program for compiler phases • Use –keep option to see intermediate files. (Need to add “.” to include directories on compile.)

  46. NVCC • Compiles to “.cu” into a “.cu.cpp” file • Two types of targets: virtual and real, represented in PTX assembly code and “cubin” binary code, respectively.

  47. PTXAS • Compiles PTX assembly code into machine code, placed in an ELF module. • # cat hw.sm_10.cubin | od -t x1 | head • 0000000 7f 45 4c 46 01 01 01 33 02 00 00 00 00 00 00 00 • 0000020 02 00 be 00 01 00 00 00 00 00 00 00 34 18 00 00 • 0000040 34 00 00 00 0a 01 0a 00 34 00 20 00 03 00 28 00 • 0000060 16 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 • 0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 • 0000120 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 • 0000140 03 00 00 00 00 00 00 00 00 00 00 00 a4 03 00 00 • 0000160 7f 01 00 00 00 00 00 00 00 00 00 00 04 00 00 00 • 0000200 00 00 00 00 0b 00 00 00 03 00 00 00 00 00 00 00 • 0000220 00 00 00 00 23 05 00 00 22 00 00 00 00 00 00 00 • Disassembly of the machine code can be done using cuobjectdump or my own utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438)

  48. PTX, the GPU assembly code • PTX = “Parallel Thread Execution” • Target for PTX is an abstract GPU machine. • Contains operations for load, store, register declarations, add, sub, mul, etc. .version 1.4 .target sm_10, map_f64_to_f32 // compiled with …/be.exe // nvopencc 4.0 built on 2011-03-24 .entry _Z3funPi ( .param .u32 __cudaparm__Z3funPi_mem) { .reg .u32 %r<4>; .loc 16 4 0 $LDWbegin__Z3funPi: .loc 16 6 0 mov.s32 %r1, 1; ld.param.u32 %r2, [__cudaparm__Z3funPi_mem]; st.global.s32 [%r2+0], %r1; .loc 16 7 0 exit; $LDWend__Z3funPi: } // _Z3funPi

  49. CUDA GPU targets • Virtual – PTX code is embedded in executabe as a string, then compiled at runtime “just-in-time”. • Real – PTX code is compiled into target execute.

  50. Next time • For next week, we will go into more detail: • The CUDA runtime API; • Writing efficient CUDA code; • Look at some important examples.

More Related