1 / 28

Introduction to CUDA

Introduction to CUDA. 2009/04/07 Yun -Yang Ma. Outline. Overview What is CUDA Architecture Programming Model Memory Model H.264 Motion Estimation on CUDA Method Experimental Results Conclusion. Overview.

xenia
Download Presentation

Introduction to CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to CUDA 2009/04/07 Yun-Yang Ma

  2. Outline • Overview • What is CUDA • Architecture • Programming Model • Memory Model • H.264 Motion Estimation on CUDA • Method • Experimental Results • Conclusion

  3. Overview • In the past few years, Graphic Processing Unit (GPU) processing capability grows rapidly

  4. Overview • General-purpose Computation on GPUs (GPGPU) • Not only for accelerating the graphics display but also for speeding up non-graphics applications • Linear algebra computation • Scientific simulation

  5. What is CUDA ? • Compute Unified Device Architecture • http://www.nvidia.com.tw/object/cuda_home_tw.html#(nVidia CUDA Zone) • Single program multiple data (SPMD) computing device Fast Object Detection Leukocyte Tracking Real-time 3D modeling

  6. What is CUDA ? • Architecture

  7. What is CUDA ? • ProgrammingModel • Twopartsofprogramexecuting • Host:CPU • Device:GPU Host Mainprogram Device doparallelism ....... Kernel ....... ....... ....... ....... ....... ....... ....... EndofKernel EndofMain

  8. What is CUDA ? • ThreadBatching • CUDAcreatesalotofthreadsonthedevicetheneachthreadwillexecutekernelprogramwithdifferentdata • Thethreadsinthesamethreadblockcanco-workwitheachotherthroughthesharedmemory • Number of threads in a thread block is limited • Thread blocks with same dimension can be organized as a grid and do thread batching

  9. What is CUDA ? • ThreadBatching Host Device Grid1 Kernel1 Block(0,0) Block(1,0) Block(2,0) Block(0,1) Block(1,1) Block(2,1) Grid2 Block (1,0) Block(0,0) Block(1,0) Block(2,0) Kernel2 Block(0,1) Block(1,1) Block(2,1) Block(0,2) Block(1,2) Block(2,2) Block(0,3) Block(1,3) Block(2,3)

  10. What is CUDA ? • Memory Model • DRAM • Chip memory Grid Block(1, 0) Block(0, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory

  11. What is CUDA ? • Example : Vector addition Kernel program

  12. H.264 ME on CUDA • In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC • MB mode P_16x16 P_16x8 P_8x8 P_8x16 8x8 8x4 4x8 4x4 [1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008)

  13. H.264 ME on CUDA • Two steps for deciding the final coding mode • Step 1 : Find best motion vectors of each MB mode • Step 2 : Evaluate the R-D performance and choose the best mode • H.264 ME algorithm is extremely complex and time consuming • Fast motion estimation method (TSS, DS, etc.) • In [1], they focus on Full Search ME Too much branch instruction

  14. Method • First stage : Calculate integer pixel MVs 16 4 4 Compute all SAD values between each block and all reference candidates in parallel 16 Find the minimal SAD and determine the integer pixel MV Merge 4x4 SADs to form all block sizes 16 8 4 8 16 8 16 4 8 8 16 8

  15. Method • Second stage : Calculate fraction-pixel MVs • Reference frame is interpolated using a six-tap filter and a bilinear filter defined in H.264/AVC • Calculate the SADs at 24 fractional pixel positions that are adjacent to the best integer MV Half pixel 5 1 3 4 2 Quarter pixel 6 8 9 10 7 14 13 12 11 Integer pixel 19 17 18 16 15 24 22 23 21 20

  16. Method • 4x4 Block-Size SAD Calculation • Sequence resolution : 4CIF (704x576) • Search range : 32 x 32 ( leads to 1024 candidates ) • Each candidate SAD is computed by a thread • 256 threads executed in a thread block Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block 256 threads in a thread block Number of ME search candidates 4x4 blocks number in a frame = 706/4 x 576/4 x 322 x 1/256 = 101376

  17. Method • Block diagram of 4x4 block SAD calculation B1 256 SADs ●● ‧ ‧ ‧ ‧ ‧ ‧● … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B2 256 SADs … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B3 256 SADs … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B4 256 SADs … T1 T2 T256 1024 candidates of an 4x4 block … … … B101376 256 SADs Kernel DRAM

  18. Method • Variable Block-Size SAD Generation • Merge the 4x4 SADs obtained in the previous step • Each thread fetches sixteen 4x4 SADs of one MB at a candidate position and combines them to form other block size = 706/16 x 576/16 x 322 x 1/256 = 6336

  19. Method • Block diagram of variable block size SAD calculation B1 4x8 SAD x8 8x4 SAD x8 8x8 SAD x4 8x16 SAD x2 16x8 SAD x2 16x16 SAD x1 16 SADs T1 T2 16 SADs … … T256 16 SADs … B2 … … B6336 Kernel DRAM DRAM

  20. Method • Integer Pixel SAD Comparison • All 1024 SADs of one block are compared and the least SAD is chosen as the integer-pixel MV • Each block size (16x16 to 4x4) has its own kernels for SAD comparison • Seven kernels are implemented and executed sequentially

  21. Method • Block diagram of integer pixel SAD comparison 1024 SADs DRAM Kernel 4 SADs 4 SADs 4 SADs B1 … T1 T256 T2 SAD SAD SAD shared memory n iterations 256/2n-1 SADs T1 ~ T128/2n-1 Integer-pel MV

  22. Method • During the thread reduction process, aproblem may occur • Shared memory bank conflict • A sequential addressing with non-divergent branching strategy is adopted

  23. Method • SAD comparison using sequential addressing with non-divergent branching Shared memory (SAD value & index) Thread ID (Do comparison) 5 6 1 2 3 4 7 8 1 2 3 4 …

  24. Method • Fractional Pixel MV Refinement • Find the best fractional-pixel motion vector around the integer motion vector of every block Encoding Frame Integer- pel MV Reference Frame DRAM Kernel Half pixel 5 1 3 4 2 B1 shared memory Quarter pixel 6 8 9 10 7 14 13 12 11 Integer pixel … T1 T24 T2 19 17 18 16 15 shared memory 24 22 23 21 20 24/2n-1 SADs n iterations T1 ~ T12/2n-1 fractionl-pel MV

  25. Experimental Results • Environment • AMD Athlon 64 X2 Dual Core 2.1GHz with 2G memory • NVIDIA GeForce 8800GTX with 768MB DRAM • CUDA Toolkit and SDK 1.1 • Parameters • ME algorithm : Full Search • Search Range : 32x32

  26. Experimental Results • The average execution time in ms for processing one frame using the proposed algorithm

  27. Experimental Results • The ME performance comparison between CPU only and using GPU

  28. Conclusions • In this paper, they present an efficient block level parallelized algorithm for variable block size motion estimation using CUDA GPU. • GPU acting as a coprocessor can effectively accelerate massive data computation.

More Related