1 / 33

CUDA parallel programming technology

CUDA parallel programming technology. O.I. Streltsova , D.V. Podgainy , M.I. Zuev Heterogeneous Computation Team  HybriLIT Laboratory of Information Technologies , JINR, Dubna , Russia. From 3D graphics to GPGPU. Tesla V100 (2017). Tesla K80 (201 4 ). NVIDIA GeForce 8 (G80 chips).

mroop
Download Presentation

CUDA parallel programming technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA parallel programming technology O.I. Streltsova, D.V. Podgainy, M.I. Zuev Heterogeneous Computation Team HybriLIT Laboratory of Information Technologies, JINR, Dubna, Russia

  2. From 3D graphics to GPGPU Tesla V100 (2017) Tesla K80 (2014) NVIDIA GeForce 8 (G80 chips) CUDA NVIDIA unveiled in 2006 Quake 1 (1996) GPGPU – General-purpose graphics processing units https://developer.nvidia.com/cuda-zone

  3. HybriLIT: GPUs NVIDIA Tesla K40 “Atlas” NVIDIA Tesla K80 NVIDIA Tesla V100 2x Kepler GK210 4992 CUDA GPU cores Memory24 GB (12GBper GPU) Peak performance (with GPU Boost) 8.74TFlops single-precision 2.91TFlops double-precision 2880 CUDA GPU cores Memory12GB GDDR5 Peak performance 4.29TFlops single-precision 1.43TFlops double-precision 5120 CUDA cores Memory16 GB Peak performance 15.7 TFLOPS single-precision 7.8TFLOPS double-precision

  4. CPU / GPU Architecture Intel Xeon Gold 6154 18 cores NVIDIA V100 5120 cores https://svi.nl/HuygensGPU

  5. CUDA architecture GV100

  6. Volta GV100 Streaming Multiprocessor • The GV100 SM is partitioned into four processing blocks, each with • 16 FP32 cores; • 8 FP64 cores; • 16 INT32 cores; • two mixed-precision Tensor cores; • …

  7. CUDA programming model

  8. Device Memory Hierarchy Registers are fast, off-chip local memory has high latency Tens of kb per block, on-chip, very fast Size up to 16 GB, high latency Random access very expensive! Coalesced access much more efficient CUDA C Programming Guide (February 2014)

  9. Hybrid model of code Serial code Kernel_1 <<< nBlock, nThread >>> (args); Serial code … … Kernel_2 <<< nBlock, nThread >>> (args); __global__ void Kernel_1 ( args ){ } int main{ ... Kernel_1 <<< nBlock, nThread, nShMem, nStream >>> ( parameters ); ... } Language extensions: Kernel execution directive

  10. CUDA programming model. CUDA C Definitions: device = GPU; host = CPU; kernel= function that runs on the device • Parallel parts of an application are executed on the device as kernels • One kernel is executed at a time • Several threads execute each kernel kernel <<< dim3 dG, dim3 dB, nSgMem, nStream >>> ( args ) • dim3dG – number of blocks dimension (grid) • dim3dB – dimension of blocks • nShMem– the amount of shared memory to be allocated for each block • nStream– the stream number dim3 is an integer vector type that can be used in CUDA code. dim3 dG(512); // 512 x 1 x 1 dim3 dB(1024, 1024); // 1024 x 1024 x 1

  11. CUDA. Function Type Qualifiers CPU GPU __global__ __device__ __global__ __host__

  12. Scheme program on CUDA C/C++ and C/C++

  13. Threads and blocks tid – index of threads • inttid = threadIdx.x + blockIdx.x * blockDim.x

  14. Task 1: GPU vector sums CUDA code Serial code __global__ void kernel(int N, int *a, int *b, int *c){ inttid = threadIdx.x + blockIdx.x * blockDim.x; while(tid < N){ c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x; } } void sum(int N, int *a, int *b, int *c){ inttid = 0; while(tid < N){ c[tid] = a[tid] + b[tid]; tid += 1; } } Compilation $ module add cuda/v9.2 $ nvcc 04_cublas.cu -o exec_cuda

  15. Error Handling Run-Time Errors: every CUDA call (except kernel launches) return an error code of type cudaError_t Getting the last error: cudaError_tcudaGetLastError(void) //------- Getting the last error -------// cudaError_terror= cudaGetLastError(); if(error != cudaSuccess){ printf("CUDA error, CUFFT: %s\n", СudaGetErrorString(error)); exit(-1); }

  16. Error Handling: uses macro substitution #define CUDA_CALL(x) do{ cudaError_t err= x; if(( err ) != cudaSuccess ) { \ printf ("Error \"%s\" at %s :%d \n" , cudaGetErrorString(err), \ __FILE__ , __LINE__ ) ; return -1;\ }} while (0); that works with any CUDA call that returns an error code. Example: CUDA_CALL( cudaMalloc((void**)dA, n*n*sizeof(float)) ); • GPU kernel calls return before completing, cannot immediately use cudaGetLastError to get errors: • First use cudaDeviceSynchronize()as a synchronization point; • Then call cudaGetLastError.

  17. Timing a CUDA application using events //------- Event Management -------// cudaEvent_tstart, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); //------- Some operation on CPU and GPU -------// cudaEventRecord(stop); float time0 = 0.0; cudaEventSynchronize(stop); cudaEventElapsedTime(&time0, start, stop); printf(“GPU compute time_load data(msec): %.5f\n”, time0); //------- Destroy events -------// cudaEventDestroy(start); cudaEventDestroy(stop);

  18. Unified Memory: from CUDA 6 CPU GPU CPU GPU Unified Memory GPU memory System memory cudaMalloc() cudaMallocManaged() cudaMemcpy() More info in CUDA Toolkit Documentation

  19. Unified Memory: from CUDA 6 AllocatesmemorythatwillbeautomaticallymanagedbytheUnifiedMemorysystem CPU GPU • __host__ cudaError_tcudaMallocManaged • ( void** devPtr, size_tsize, unsignedintflags = cudaMemAttachGlobal ) Unified Memory • devPtrPointertoallocateddevicememory • sizeRequestedallocationsizeinbytes • flagsMustbeeither • cudaMemAttachGlobalor • cudaMemAttachHost • (defaultstocudaMemAttachGlobal) More info in CUDA Toolkit Documentation

  20. Use numerical libraries for GPUs NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP NVIDIA cuBLAS Matrix Algebra on GPU and Multicore GPU Accelerated Linear Algebra Vector Signal Image Processing NVIDIA cuFFT ArrayFire Matrix Computations Sparse Linear Algebra C++ STL Features for CUDA IMSL Library https://developer.nvidia.com/gpu-accelerated-libraries

  21. Task 3: Element-wise vector multiplication by a scalar $ nvcctestBlas_CUDA.cu -lcublas -o exec_cuda float double CuComplex cuDoubleComplex float double CuComplex cuDoubleComplex cuBLAS library Compilation

  22. cuBLAS library: cublas<t>scal() • The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. • cublasStatus_tcublasSscal(cublasHandle_t handle, int n,  • constfloat *alpha, float *x, intincx) • cublasStatus_tcublasDscal(cublasHandle_t handle, int n, • constdouble *alpha, double *x, intincx) • cublasStatus_tcublasCscal(cublasHandle_t handle, int n, • constcuComplex *alpha, cuComplex *x, intincx) • cublasStatus_tcublasCsscal(cublasHandle_t handle, int n, • constfloat *alpha, cuComplex *x, intincx) • cublasStatus_tcublasZscal(cublasHandle_t handle, int n, • constcuDoubleComplex *alpha, • cuDoubleComplex *x, intincx) • cublasStatus_tcublasZdscal(cublasHandle_t handle, int n, • constdouble *alpha, cuDoubleComplex *x, intincx) http://docs.nvidia.com/cuda/cublas/

  23. The Discrete Fourier Transform (DFT) The Discrete Fourier transform (DFT) maps a complex-valued vector Forward transform (time domain) (frequency domain representation) Inverse transform Escape: no normalized output

  24. CUFFT library • Key Features • 1D, 2D, 3D transforms of complex and real data types; • 1D transform sizes up to 128 million elements; • Flexible data layouts by allowing arbitrary strides between individual elements and array dimensions: • FFT algorithms based on Cooley-Tukey and Bluestein • Familiar API similar to FFTW Advanced Interface • Streamed asynchronous execution • Single and double precision transforms; • Batch execution for doing multiple transforms; • In-place and out-of-place transforms; • Flexible input & output data layouts, similar to FFTW “Advanced Interface”; • Thread-safe & callable from multiple host threads http://docs.nvidia.com/cuda/cufft/

  25. CUFFT library • cuFFT uses algorithms based on the well-known Cooley-Tukey and Bluestein algorithms • Algorithms highly optimized for input sizes that can be written in the form cuFFT library allocates space for cufftComplex or cufftDoubleComplex elements where - batch denotes the number of transforms that will be executed in parallel - rank is the number of dimensions of the input data

  26. Task 4:1D transforms of complex data type for vector length NX Task 5:NY×1Dtransforms of complex data type for vector length NX A_input[], = 0,1,…,, = 0,1,…, NY • Typical scheme : • allocate memory on the device • Create and configure the transformation and cufftPlan • Perform a transform • Free memory and transformation CUFFT - Transform types: ‣R2C: real to complex ‣ C2R: Complex to real ‣ C2C: complex to complex ‣ D2Z: double to double complex ‣ Z2D: double complex to double ‣ Z2Z: double complex to double complex ‣ cufftPlan1d() ‣ cufftPlan2d() ‣ cufftPlan3d() ‣ cufftPlanMany

  27. CUFFT library. cufftPlanMany $ nvcctestBlas_CUDA.cu –lcublas -lcufft -o exec_cuda cuFFT library Compilation

  28. Task 2: Shared memory cuBLAS: Level-1 Basil Linear Algebra Subprograms (BLAS1) cublas<t>asum(): cublas<t>nrm2(): Shared memory

  29. Task 2: Shared memory (Parallel realization of complex numbers’ module sums’ computations) tid= 0 tid= 1 tid= 2 tid= 0 tid= 1 tid= 2 tid= 0 tid= 1 tid= 2 The number of blocks in the grid tid=threadIdx.x BLOCKS ×TH_BLOCKS The number of threads per block N Shared memory available per block in bytes 49152 bytes i =2+3*1=5 inti= threadIdx.x+blockDim.x*blockIdx.x;

  30. Task 2: Shared memory BLOCKS = 64 TH_BLOCKS = 128 tid= 0 tid= 1 tid= 2 tid= 0 tid= 1 tid= 2 tid= 0 tid= 1 tid= 2 __shared__ doublecache[TH_BLOCKS]; tid=threadIdx.x N cache[TH_BLOCKS] cache[TH_BLOCKS] cache[TH_BLOCKS]

  31. kernel_sum_shared () __global__ void kernel_sum_shared( intnx, cuDoubleComplex* dev_vrC, double* c ){ volatile __shared__ double cache[TH_BLOCKS]; inttid= threadIdx.x; inti= threadIdx.x+blockDim.x*blockIdx.x; double a= 0.0; while( i<nx ){ a+= cuCabs(dev_vrC[i]); i+= blockDim.x * gridDim.x; } cache[tid]= a; __syncthreads(); ints= blockDim.x/2; while( s!=0 ){ if(tid<s ) cache[tid]+= cache[tid+s]; //reduction __syncthreads(); s /= 2; } if( tid==0) c[blockIdx.x]= cache[0]; __syncthreads(); } copy to shared memory and computeabs(dev_vrC[i]) in shared memory

  32. CUDA materials NVIDIA website: http://www.nvidia.com/ NVIDIA CUDA Zone: https://developer.nvidia.com/cuda-zone Jason Sanders, Edward Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. List of Books: https://developer.nvidia.com/cuda-books-archive

  33. Thanks for your attention ! Parallel programming technologies and high-performance computing within the HybriLIT platform November, 2018 Czech Republic

More Related