html5-img
1 / 18

Parallel GPU Programming with NVIDIA Cuda

Çankaya University Computer Engineering Department. Parallel GPU Programming with NVIDIA Cuda. Ahmet Artu YILDIRIM. January 20 10. Overview. Introduction and Comparison between CPU & GPU The Execution Model The Memory Model CUDA API Basics and Sample Kernel Function Case Study

Download Presentation

Parallel GPU Programming with NVIDIA Cuda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Çankaya University Computer Engineering Department Parallel GPU Programming with NVIDIA Cuda Ahmet Artu YILDIRIM January 2010

  2. Overview • Introduction and Comparison between CPU & GPU • The Execution Model • The Memory Model • CUDA API Basics and Sample Kernel Function • Case Study • Other GPU Programming Models Parallel GPU Programming with NVIDIA Cuda

  3. Introduction • Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computationalpower and very high memory bandwidth • CUDA (Compute Unified Device Architecture) is the parallel computing engine in NVIDIA GPUs that is accessible to software developers through standard programming languages like C and Fortran. Parallel GPU Programming with NVIDIA Cuda

  4. Comparisonbetween CPU and GPU Floating-Point Operations Per Second Memory Bandwidth Parallel GPU Programming with NVIDIA Cuda

  5. Comparisonbetween CPU and GPU The GPU devotes more transistors to data processing The GPU is especially well-suited to data-parallel computations in nature Parallel GPU Programming with NVIDIA Cuda

  6. Execution Model Thread: Smallest execution unit Block: Collection of threads Grid: Highest level Parallel GPU Programming with NVIDIA Cuda

  7. Memory Model Parallel GPU Programming with NVIDIA Cuda

  8. CUDA API Basics • Extension to the C programming language • Cuda source file compiled by nvcc.exe program • Function and variable type qualifiers to specify execution on host or device • Built-in variables that specify the grid and block dimension in kernel function Parallel GPU Programming with NVIDIA Cuda

  9. CUDA API Basics • Function type qualifiers • __device__ • Executed on the device • Callable from the device only. • __global__ • Executed on the device, • Callable from the host only. • __host__ • Executed on the host, • Callable from the host only. Parallel GPU Programming with NVIDIA Cuda

  10. CUDA API Basics • Variable Type Qualifiers • __device__ • Resides in global memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __constant__ (optionallyused together with __device__) • Resides in constant memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __shared__ (optionally used together with __device__) • Resides in the shared memory space of a thread block, • Has the lifetime of the block, • Is only accessible from all the threads within the block. Parallel GPU Programming with NVIDIA Cuda

  11. ExecutionFlow Parallel GPU Programming with NVIDIA Cuda

  12. CUDA API Basics (Sample 1) • // Kernel definition • __global__ void MatAdd(float A[N][N], float B[N][N], • float C[N][N]) • { • int i = threadIdx.x; • int j = threadIdx.y; • C[i][j] = A[i][j] + B[i][j]; • } • int main() • { • // Kernel invocation • dim3 dimBlock(N, N); • MatAdd<<<1, dimBlock>>>(A, B, C); • } Parallel GPU Programming with NVIDIA Cuda

  13. CUDA API Basics (Sample 2) • __global__ void square_array(float *a, int N) • { • int idx = blockIdx.x * blockDim.x + threadIdx.x; • if (idx<N) • a[idx] = a[idx] * a[idx]; • } • int _tmain(int argc, _TCHAR* argv[]) • { • //initialize a_h before gpu calculation • cudaMalloc((void **) &a_d, size); // Allocate array on device • cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); • int block_size = 100; • int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); • square_array <<< n_blocks, block_size >>> (a_d, N); • cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); • free(a_h); • cudaFree(a_d); • } Parallel GPU Programming with NVIDIA Cuda

  14. Case Study: Comparison of Data Intensive and Compute Intensive GPU Program

  15. Device Configurations • Graphics Adapter: • GeForce 8600M GT • Bus: PCI Express x16 • Stream processors: 32 • Core clock: 475 MHz • Video memory: 256 MB • Memory Interface: 128 bit • Memory clock: 702 MHz (1404 MHz data rate) • CPU: • Processor: Intel(R) Core(TM)2 Duo Mobile Processor T9300 • CPU Speed: 2.50 GHz • Bus Speed: 800 MHz • L2 Cache Size: 6 MB • Memory: 3.00 GB Parallel GPU Programming with NVIDIA Cuda

  16. CPU & GPU Benchmark Scaled Running Time Comparison Graph Parallel GPU Programming with NVIDIA Cuda

  17. Other General Purpose GPU Models Programming Models: OpenCL: Open industry standard by Khronos group Microsoft Direct Compute GPU Processing Adapters: AMD FireStream Parallel GPU Programming with NVIDIA Cuda

  18. Questions? Thank You

More Related