Introduction to CUDA

Introduction to CUDA 2009/04/07 Yun-Yang Ma

Outline • Overview • What is CUDA • Architecture • Programming Model • Memory Model • H.264 Motion Estimation on CUDA • Method • Experimental Results • Conclusion

Overview • In the past few years, Graphic Processing Unit (GPU) processing capability grows rapidly

Overview • General-purpose Computation on GPUs (GPGPU) • Not only for accelerating the graphics display but also for speeding up non-graphics applications • Linear algebra computation • Scientific simulation

What is CUDA ? • Compute Unified Device Architecture • http://www.nvidia.com.tw/object/cuda_home_tw.html#(nVidia CUDA Zone) • Single program multiple data (SPMD) computing device Fast Object Detection Leukocyte Tracking Real-time 3D modeling

What is CUDA ? • Architecture

What is CUDA ? • ProgrammingModel • Twopartsofprogramexecuting • Host:CPU • Device:GPU Host Mainprogram Device doparallelism ．．．．．．． Kernel ．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．．． EndofKernel EndofMain

What is CUDA ? • ThreadBatching • CUDAcreatesalotofthreadsonthedevicetheneachthreadwillexecutekernelprogramwithdifferentdata • Thethreadsinthesamethreadblockcanco-workwitheachotherthroughthesharedmemory • Number of threads in a thread block is limited • Thread blocks with same dimension can be organized as a grid and do thread batching

What is CUDA ? • ThreadBatching Host Device Grid1 Kernel1 Block(0,0) Block(1,0) Block(2,0) Block(0,1) Block(1,1) Block(2,1) Grid2 Block (1,0) Block(0,0) Block(1,0) Block(2,0) Kernel2 Block(0,1) Block(1,1) Block(2,1) Block(0,2) Block(1,2) Block(2,2) Block(0,3) Block(1,3) Block(2,3)

What is CUDA ? • Memory Model • DRAM • Chip memory Grid Block(1, 0) Block(0, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory

What is CUDA ? • Example : Vector addition Kernel program

H.264 ME on CUDA • In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC • MB mode P_16x16 P_16x8 P_8x8 P_8x16 8x8 8x4 4x8 4x4 [1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008)

H.264 ME on CUDA • Two steps for deciding the final coding mode • Step 1 : Find best motion vectors of each MB mode • Step 2 : Evaluate the R-D performance and choose the best mode • H.264 ME algorithm is extremely complex and time consuming • Fast motion estimation method (TSS, DS, etc.) • In [1], they focus on Full Search ME Too much branch instruction

Method • First stage : Calculate integer pixel MVs 16 4 4 Compute all SAD values between each block and all reference candidates in parallel 16 Find the minimal SAD and determine the integer pixel MV Merge 4x4 SADs to form all block sizes 16 8 4 8 16 8 16 4 8 8 16 8

Method • Second stage : Calculate fraction-pixel MVs • Reference frame is interpolated using a six-tap filter and a bilinear filter defined in H.264/AVC • Calculate the SADs at 24 fractional pixel positions that are adjacent to the best integer MV Half pixel 5 1 3 4 2 Quarter pixel 6 8 9 10 7 14 13 12 11 Integer pixel 19 17 18 16 15 24 22 23 21 20

Method • 4x4 Block-Size SAD Calculation • Sequence resolution : 4CIF (704x576) • Search range : 32 x 32 ( leads to 1024 candidates ) • Each candidate SAD is computed by a thread • 256 threads executed in a thread block Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block 256 threads in a thread block Number of ME search candidates 4x4 blocks number in a frame = 706/4 x 576/4 x 322 x 1/256 = 101376

Method • Block diagram of 4x4 block SAD calculation B1 256 SADs ●● ‧ ‧ ‧ ‧ ‧ ‧● … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B2 256 SADs … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B3 256 SADs … T1 T2 T256 ●● ‧ ‧ ‧ ‧ ‧ ‧● B4 256 SADs … T1 T2 T256 1024 candidates of an 4x4 block … … … B101376 256 SADs Kernel DRAM

Method • Variable Block-Size SAD Generation • Merge the 4x4 SADs obtained in the previous step • Each thread fetches sixteen 4x4 SADs of one MB at a candidate position and combines them to form other block size = 706/16 x 576/16 x 322 x 1/256 = 6336

Method • Block diagram of variable block size SAD calculation B1 4x8 SAD x8 8x4 SAD x8 8x8 SAD x4 8x16 SAD x2 16x8 SAD x2 16x16 SAD x1 16 SADs T1 T2 16 SADs … … T256 16 SADs … B2 … … B6336 Kernel DRAM DRAM

Method • Integer Pixel SAD Comparison • All 1024 SADs of one block are compared and the least SAD is chosen as the integer-pixel MV • Each block size (16x16 to 4x4) has its own kernels for SAD comparison • Seven kernels are implemented and executed sequentially

Method • Block diagram of integer pixel SAD comparison 1024 SADs DRAM Kernel 4 SADs 4 SADs 4 SADs B1 … T1 T256 T2 SAD SAD SAD shared memory n iterations 256/2n-1 SADs T1 ~ T128/2n-1 Integer-pel MV

Method • During the thread reduction process, aproblem may occur • Shared memory bank conflict • A sequential addressing with non-divergent branching strategy is adopted

Method • SAD comparison using sequential addressing with non-divergent branching Shared memory (SAD value & index) Thread ID (Do comparison) 5 6 1 2 3 4 7 8 1 2 3 4 …

Method • Fractional Pixel MV Refinement • Find the best fractional-pixel motion vector around the integer motion vector of every block Encoding Frame Integer- pel MV Reference Frame DRAM Kernel Half pixel 5 1 3 4 2 B1 shared memory Quarter pixel 6 8 9 10 7 14 13 12 11 Integer pixel … T1 T24 T2 19 17 18 16 15 shared memory 24 22 23 21 20 24/2n-1 SADs n iterations T1 ~ T12/2n-1 fractionl-pel MV

Experimental Results • Environment • AMD Athlon 64 X2 Dual Core 2.1GHz with 2G memory • NVIDIA GeForce 8800GTX with 768MB DRAM • CUDA Toolkit and SDK 1.1 • Parameters • ME algorithm : Full Search • Search Range : 32x32

Experimental Results • The average execution time in ms for processing one frame using the proposed algorithm

Experimental Results • The ME performance comparison between CPU only and using GPU

Conclusions • In this paper, they present an efficient block level parallelized algorithm for variable block size motion estimation using CUDA GPU. • GPU acting as a coprocessor can effectively accelerate massive data computation.

Introduction to CUDA

Introduction to CUDA

Presentation Transcript

Introduction to CUDA and GPUGPU Computing

Introduction to CUDA (2 of 2)

Introduction to and CUDA

Introduction to CUDA

Introduction to CUDA Programming

CUDA Lecture 1 Introduction to Massively Parallel Computing

Introduction to CUDA Programming

An Introduction to CUDA/OpenCL and Graphics Processors

Introduction to CUDA heterogeneous programming

Lecture 3: Introduction to Parallel Computing Using CUDA

An Introduction to CUDA and Manycore Graphics Processors

Introduction to CUDA Programming

Introduction to CUDA Programming

CUDA: Introduction

A short introduction to nVidia‘s CUDA

Introduction to CUDA (2 of 2)

Introduction to CUDA Programming

An Introduction to CUDA and Manycore Graphics Processors

Introduction to CUDA Programming

Introduction to CUDA Programming