ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

Zero-Copy Host Memory These notes will introduce“zero-copy” memory. “Zero-copy” memory requires page lock-memory. These materials comes from Chapter 11 of “CUDA by Example” by Jason Sanders and Edwards Kandrot. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

Zero-copy memory • Zero-copy refers to the GPU accessing the host memory without explicitly copying the data from the host memory to the GPU memory i.e. zero copying • Depending upon the hardware structure the data may get copied though! • Integrated GPUs that are part of the system chipset and share system memory do not. --- example MacBook Pro • Discrete GPU cards with their own device memory do.

CUDA routines for zero-copy memory Use page-locked memory. Allocate with: cudaHostAlloc (void ** ptr, size_t size, unsigned intflags) Allocates page-locked memory and accessible to the device. Set flags to: cudaHostAllocMapped- Map allocation into CUDA address space. Reference: NVIDIA CUDA library http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/

Flags continued cudaHostAllocWriteCombined -- Allocates memory as “write-combined”, which can be transferred more quickly across PCIe bus on some system configurations, but cannot be read efficiently by most CPUs. Use for memory written by CPU and read by device via mapped pinned memory. Combining flags: cudaHostAllocMapped || cudaHostAllocWriteCombined Reference: NVIDIA CUDA library http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/

Device pointer to allocated memory Device pointer to memory obtained by calling: cudaHostGetDevicePointer() “Passes back device pointer corresponding to mapped, pinned host buffer allocated by cudaHostAlloc()or …” Needed to account for different address spaces. Parameters cudaHostGetDevicePointer( void ** pDevice, void * pHost, unsigned int flags) Returned device pointer for mapped memory Requested host pointer mapping Flags for extensions (must be 0 for now) Reference: NVIDIA CUDA library http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/ http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_ga475419a9b21a66036029d5001ea908c.html

Code to allocate memory and get pointer for device int *a; // host pointer int*dev_a; // device pointer to host memory size = … ; // number of bytes to allocate cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostGetDevicePointer(&dev_a, a, 0); Allocate pinned memory on host: Get device point to it: Now do not need to copy memory from host to device: If desired

Using pointer to host memory Simply use returned pointer in kernel call where one would otherwise have used a device memory pointer: MyKernel<<< B,T>>> (dev_a, … ); without needing to modify the kernel code at all!

#include <stdio.h> #include <cuda.h> #include <stdlib.h> #define N 32 // size of vectors __global__ void add(int *a,int *b, int *c) { inttid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N) c[tid] = a[tid]+b[tid]; } int main(intargc, char *argv[]) { int T = 32, B = 1; // threads per block and blocks per grid int *a,*b,*c; // host pointers int *dev_a, *dev_b, *dev_c; // device pointers to host memory cudaEvent_t start, stop; // to measure time float elapsed_time_ms; cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostAlloc( (void**)&b, size, cudaHostAllocMapped || cudaHostAllocWriteCombined ); cudaHostAlloc( (void**)&c, size, cudaHostAllocMapped ); … // load arrays with some numbers cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead cudaHostGetDevicePointer(&dev_b, b, 0); cudaHostGetDevicePointer(&dev_c ,c, 0); … // start time add<<<B,T>>>(dev_a,dev_b,dev_c); cudaThreadSynchronize(); // copy back not needed but now need thread synchronization … // end time … // print results printf("Time to calculate results: %f ms.\n", elapsed_time_ms); // print out execution time cudaFreeHost(a); // clean up cudaFreeHost(b); cudaFreeHost(c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } Example Vector addition without host-device transfers Note flag book seems to miss out this special free routine when using cudaHostAlloc

Host Memory pointed to from device Host (CPU) Device (GPU) Host memory __global__ void add(int *a, … ) { … } cudaHostAlloc( (void**)&a, … ); cudaHostGetDevicePointer(&dev_a, a, 0); MyKernel<<< B,T>>> (dev_a, … );

Code to determine whether GPU has the capability of features being used Look at device properties: cudaDeviceProp prop; intmyDevice; cudaGetDevice(&myDevice); cudaGetDeviceProperties(&prop, myDevice); If (prop.property_name != 1) printf(“Feature not available\n”); Returns device executing thread Returns a structure, see next Various property names, see next

structcudaDeviceProp { char name[256]; size_ttotalGlobalMem; size_tsharedMemPerBlock; intregsPerBlock; intwarpSize; size_tmemPitch; intmaxThreadsPerBlock; intmaxThreadsDim[3]; intmaxGridSize[3]; size_ttotalConstMem; int major; int minor; intclockRate; size_ttextureAlignment; intdeviceOverlap; intmultiProcessorCount; intkernelExecTimeoutEnabled; int integrated; intcanMapHostMemory; intcomputeMode; intconcurrentKernels; } Properties

Checking can map page-locked host memory into device address space … cudaDeviceProp prop; intmyDevice; cudaGetDevice(&myDevice); cudaGetDeviceProperties(&prop, myDevice); If (prop.canMapHostMemory != 1) { printf(“Feature not available\n”); return 0; } … Very likely as only needs compute capability > 1.0

Integrated GPU systems Example: My 13” MacBook Pro, 2010 Zero-copy memory particularly interesting with integrated GPU systems where system memory is shared between CPU and GPU. Increased performance will always result when using zero-copy memory (according to the course textbook) CPU GPU 2.4 GHz Intel Core 2 Duo NVIDIA GeForce 320M Shared between CPU and GPU 256 MB DDR3 SDRAM DDR3 SDRAM 4 GB Main memory Intel Graphics Media Accelerator (GMA ) shared bus on 15/17” models

Using multiple GPU on one system Each GPU needs to be controlled by a separate thread: Code GPU 1 Thread 1 Thread 2 GPU 2 So need to write a multi-threaded program using thread APIs/tools such as Pthreads, WinThreads, OpenMP, … .

… #if _WIN32 //Windows threads. #include <windows.h> typedef HANDLE CUTThread; typedef unsigned (WINAPI *CUT_THREADROUTINE)(void *); #define CUT_THREADPROC unsigned WINAPI #define CUT_THREADEND return 0 #else //POSIX threads. #include <pthread.h> typedefpthread_tCUTThread; typedef void *(*CUT_THREADROUTINE)(void *); #define CUT_THREADPROC void #define CUT_THREADEND #endif //Create thread. CUTThreadstart_thread( CUT_THREADROUTINE, void *data ); //Wait for thread to finish. void end_thread( CUTThread thread ); //Destroy thread. void destroy_thread( CUTThread thread ); //Wait for multiple threads. void wait_for_threads( const CUTThread *threads, int num ); #if _WIN32 //Create thread CUTThreadstart_thread(CUT_THREADROUTINE func, void *data){ return CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)func, data, 0, NULL); } //Wait for thread to finish void end_thread(CUTThread thread){ WaitForSingleObject(thread, INFINITE); CloseHandle(thread); } //Destroy thread void destroy_thread( CUTThread thread ){ TerminateThread(thread, 0); CloseHandle(thread); } //Wait for multiple threads void wait_for_threads(const CUTThread * threads, int num){ WaitForMultipleObjects(num, threads, true, INFINITE); for(int i = 0; i < num; i++) CloseHandle(threads[i]); } #else //Create thread CUTThreadstart_thread(CUT_THREADROUTINE func, void * data){ pthread_t thread; pthread_create(&thread, NULL, func, data); return thread; } //Wait for thread to finish void end_thread(CUTThread thread){ pthread_join(thread, NULL); } //Destroy thread void destroy_thread( CUTThread thread ){ pthread_cancel(thread); } //Wait for multiple threads void wait_for_threads(const CUTThread * threads, int num){ for(int i = 0; i < num; i++) end_thread( threads[i] ); } #endif … Textbook utility routines for multi-threading Found in ../common/book.h Provides for Win32 Threads for Windows or Pthreads for Linux thread = start_thread(funct,ptr) Used to start a new thread Takes as arguments: void* funct (void*) void* ptr Returns CUTThread type thread identifier To terminate thread (join to main thread): end_thread(thread)

Pinned memory on multiple GPUs Pinned memory only pinned by thread allocating the pinned memory Other threads see it as pageable and access slower. These threads cannot use cudaMemcpyAsync, which requires pinned memory “Portable” pinned memory Memory allowed to move between host threads and any thread to see it as pinned memory Use cudaHostAlloc and include cudaAllocPortable flag

Questions More information – See Chapter 11 of “CUDA by Example” by Jason Sanders and Edwards Kandrot, Addison-Wesley, 2011

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

Presentation Transcript

Prolog: Programming in Logic

GPGPU: Distance Fields

Accelerating SQL Database Operations on a GPU with CUDA

Integer Programming

Introduction to CUDA (2 of 2)

INTRODUCTION TO FUNCTIONAL PROGRAMMING

CUDA Lecture 1 Introduction to Massively Parallel Computing

GPU Optimization using CUDA Framework

CUDA Lecture 3 Parallel Architectures and Performance Analysis

C ++ Programming Languages

Introduction to Programming

Network Application Programming

Introducing the ‘c’ Programming language

52.223 Low Level Programming Lecturer: Duncan Smeed

Object Oriented Programming in Java

CSE-321 Programming Languages Introduction to Functional Programming

Network Programming

March, 23 - 24, 2013 SWU, Blagoevgrad