1 / 31

Basic CUDA Programming

Basic CUDA Programming. Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University. What will you learn in this lab?. Concept of multicore accelerator Multithreaded/multicore programming

makaio
Download Presentation

Basic CUDA Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University

  2. What will you learn in this lab? • Concept of multicore accelerator • Multithreaded/multicore programming • Memory optimization

  3. Slides • Mostly from Prof. Wen-Mei Hwu of UIUC • http://courses.ece.uiuc.edu/ece498/al/Syllabus.html

  4. CUDA – Hardware? Software?

  5. Host-Device Architecture CPU (host) GPU w/ local DRAM (device)

  6. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory G80 CUDA mode – A Device Example

  7. MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm Functional Units in G80 • Streaming Multiprocessor (SM) • 1 instruction decoder ( 1 instruction / 4 cycle ) • 8 streaming processor (SP) • Shared memory SM 0 SM 1 Blocks Blocks

  8. Setup CUDA for Windows

  9. CUDA Environment Setup • Get GPU that support CUDA • http://www.nvidia.com/object/cuda_learn_products.html • Download CUDA • http://www.nvidia.com/object/cuda_get.html • CUDA driver • CUDA toolkit • CUDA SDK (optional) • Install CUDA • Test CUDA • Device Query

  10. Setup CUDA for Visual Studio • From scratch • http://forums.nvidia.com/index.php?showtopic=30273 • CUDA VS Wizard • http://sourceforge.net/projects/cudavswizard/ • Modified from existing project

  11. Lab1: First CUDA Program

  12. CUDA Computing Model

  13. Data Manipulation between Host and Device • cudaError_t cudaMalloc( void** devPtr, size_t count ) • Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory • cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) • Copies count bytes from memory area pointed to by src to the memory area pointed to by dst • kind indicates the type of memory transfer • cudaMemcpyHostToHost • cudaMemcpyHostToDevice • cudaMemcpyDeviceToHost • cudaMemcpyDeviceToDevice • cudaError_t cudaFree( void* devPtr ) • Frees the memory space pointed to by devPtr

  14. Example • Functionality: • Given an integer array A holding 8192 elements • For each element in array A, calculate A[i]256 and leave the result in B[i]

  15. Now, go and finish your first CUDA program !!!

  16. Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip • Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj ) • main.cu • Random input generation, output validation, result reporting • device.cu • Lunch GPU kernel, GPU kernel code • parameter.h • Fill in appropriate APIs • GPU_kernel() in device.cu

  17. Lab2: Make the Parallel Code Faster

  18. Parallel Processing in CUDA • Parallel code can be partitioned into blocks and threads • cuda_kernel<<<nBlk, nTid>>>(…) • Multiple tasks will be initialized, each with different block id and thread id • The tasks are dynamically scheduled • Tasks within the same block will be scheduled on the same stream multiprocessor • Each task take care of single data partition according to its block id and thread id

  19. Locate Data Partition by Built-in Variables • Built-in Variables • gridDim • x, y • blockIdx • x, y • blockDim • x, y, z • threadIdx • x, y, z

  20. Data Partition for Previous Example When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

  21. Processing Single Data Partition

  22. Parallelize Your Program !!!

  23. Partition kernel into threads • Increase nTid from 1 to 512 • Keep nBlk = 1 • Group threads into blocks • Adjust nBlk and see if it helps • Maintain total number of threads below 512, e.g. nBlk * nTid < 512

  24. Lab3: Resolve Memory Contention

  25. Parallel Memory Architecture • Memory is divided into banks to achieve high bandwidth • Each bank can service one address per cycle • Successive 32-bit words are assigned to successive banks

  26. Lab2 Review When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

  27. How about Interleave Accessing? When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

  28. Implementation of Interleave Accessing cuda_kernel<<<1, 4>>>(…) • head = task_sn • stripe = total_task

  29. Improve Your Program !!!

  30. Modify original kernel code in interleaving manner • cuda_kernel() in device.cu • Adjusting nBlk and nTid as in Lab2 and examine the effect • Maintain total number of threads below 512, e.g. nBlk * nTid < 512

  31. Thank You • http://twins.ee.nctu.edu.tw/~skchen/lab3.zip • Final project issue • Group issue

More Related