1 / 51

Gastcollege GPU-team Multimedia Lab

Gastcollege GPU-team Multimedia Lab. Charles Hollemeersch Bart Pieters 28/11/2008. Who we are. Charles Hollemeersch PhD student at Multimedia Lab Charlesfrederik.Hollemeersch@ugent.be Bart Pieters PhD student at Multimedia Lab Bart.Pieters@ugent.be Visit our website

edie
Download Presentation

Gastcollege GPU-team Multimedia Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gastcollege GPU-team Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008

  2. Who we are • Charles Hollemeersch • PhD student at Multimedia Lab • Charlesfrederik.Hollemeersch@ugent.be • Bart Pieters • PhD student at Multimedia Lab • Bart.Pieters@ugent.be • Visit our website • http://multimedialab.elis.ugent.be/ and http://multimedialab.elis.ugent.be/GPU

  3. Our Research Topics • Video acceleration • accelerate state-of-the-art video codecs using the GPU • Game technology • texture compression, parallel game actors … • Medical visualization • reconstruction of medical images … • Multi-GPU applications • …

  4. Introducing Multimedia Lab’s ‘Supercomputer’ • Quad GPU PC • four GeForce 280GTX video cards • 3732 gigaflops of GPU processing power

  5. Agenda • 8u30 – 9u45 • Bart – GPGPU • 10u00 – 11u15 • Charles – Game Technology

  6. Bart Pieters Multimedia Lab – UGent28/11/2008

  7. Overview • Introduction • GPU • GPGPU • Programming Concepts and Mappings • Direct3D and OpenGL • NVIDIA CUDA • Case Study: Decoding H.264/AVC • motion compensation • results • Conclusions • Q&A

  8. Graphics Processing Unit (GPU) • Programmable chip on graphics cards • Developed in a gaming context • 3-D scenery by means of rasterization • Programmable pipeline since DirectX 8.1 • vertex, geometry, and pixel shaders • high-level language support • Modern GPUs support high-precision • 32-bit floating point • Massive floating-point processing power • 933 gigaflops (NVIDIA GeForce 280GTX) • 141.7 GB/s peak memory bandwidth • fast PCI-Express bus, up to 2GB/sec transfer speed

  9. CPU and GPU Comparison • Today’s GPUs are yesterday’s supercomputers

  10. Why are GPUs so fast? ALU Control ALU Cache DRAM DRAM GPU CPU • Parallelism • massively-parallel/many-core architecture • needs a lot of work to be efficient • specialized hardware build for parallel tasks • more transistors mean more performance • Multi-billion dollar gaming industry drives innovation

  11. Computational Model: Stream Processing Model Input Stream Kernel Output Stream • GPU is practically a stream processor • Applications consist of streams and kernels • Each kernel takes relatively long to process (PCIe, memory latency) • latency hidden by throughput

  12. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Host L1 L1 L1 L1 L1 L1 L1 L1 Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Inside a modern GPU

  13. Introducing GPGPU Attractive platform for general-purpose computation • The GPU on commodity video cards has evolved into a processor that is • powerful • flexible • inexpensive • precise

  14. GPGPU • General-Purpose GPU • use the GPU for general-purpose algorithms • No magical GPU compiler • still no x86 processor (Larrabee?) • explicit mappings required using advanced APIs • Programming close to the hardware • trend for higher abstraction, i.e. NVIDIA CUDA • Techniques are suited for future many-core architectures • future CPU/GPU projects, AMD Fusion, Larrabee, … • Dependency issues • hundreds of independent tasks required for efficient use

  15. Stream Processing Model Revisited Input Stream Kernel Output Stream GPU is practically a stream processor Applications consist of streams and kernels Read back is not possible

  16. GPGPU in Practice

  17. GPGPU APIs • Classic way • (mis)use graphics pipeline • render a special ‘scene’ • Direct3D, OpenGL • pixel, geometry, and vertex shaders • New APIs specifically for GPGPU computations • NVIDIA CUDA, ATI CTM, DirectX11 Compute Shader, OpenCl

  18. Overview • Introduction • GPU • GPGPU • Programming Concepts and Mappings • Direct3D and OpenGL • NVIDIA CUDA • Case Study: Decoding H.264/AVC • motion compensation • results • Conclusions • Q&A

  19. 3-D Pipeline CPU GPU Graphics State Application Transform Rasterizer Shade VideoMemory(Textures) Programmable Programmable Vertices(3D) Xformed,LitVertices(2D) Fragments(pre-pixels) Finalpixels(Color, Depth) Render-to-texture Deep pipeline

  20. 3-D Pipeline - Transform Vertex wave(Vertex vin) { Vertex vout; vout.x = vin.x; vout.y = vin.y; vout.z = (sin(vin.x) + sin(IN.wave.x)) * 2.5f; vout.color = float4(1.0f, 1.0f, 1.0f, 1.0f); return vout; } Vertex Shader struct Vertex { float3 position : POSITION; float4 color : COLOR0; }; • Vertex Shader • processing geometry data • input is a vertex • position • texture coordinates • vertex color,... • output is a vertex

  21. 3-D Pipeline - Shading PSOut shade(PSIn pin) { PSOut pout; pout.color = tex(pin.tex, sampler) return pout; } Pixel Shader struct PSIn { float2 tex; : TEXCOORD0 }; struct PSOut { float4 color; : COLOR0 }; • Pixel (or fragment) Shader • input is interpolated vertex data • position • texture coordinates • normals, … • use texels from a texture • output is a fragment • pixel color • transparancy • depth • result is stored in the frame buffer or in a texture • ‘Render to Texture’

  22. GPU-CPU Analogies • Explicit mapping on 3-D concepts is necessary • Rewrite an algorithm and find parallelism • Use the GPU in parallel to the CPU • upload data to the GPU • very fast PCI-Express bus, up to 2GB/sec transfer speed • process the data • meanwhile the CPU is available • download result to the CPU • recent GPU models have high download speed

  23. GPU-CPU Pipelined Design CPU Prepare GPU Data work Prepare GPU Data work Intermediary Buffer in System Memory GPU Data GPU Data GPU Process Data, Visualize Process Data, Visualize

  24. GPU-CPU Analogies (2) CPU GPU v u Texture Array

  25. GPU-CPU Analogies (3) … fish[] = createfish() … for all pixels bwfish[i][j]= bw(fish[i][j]); … CPU GPU Render Array Write = Render to Texture

  26. GPU-CPU Analogies (4) Loop body / kernel / algorithm step = Fragment Program Motion Compensation CPU GPU for (int y=0;y<height;++y) { for (int x=0;x<width;++x) { Vec2 mv = mvectors[y/4][x/4]; int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox]; } } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } Vec2 mv = mvectors[y/4][x/4]; int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox]; C++ Microsoft HLSL

  27. GPU Loop for Each Pixel Render a Quad PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Vec2 mv = in.mv; Vec2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } … PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } Vertex Shader Pixel Shader Rasterizer

  28. Overview • Introduction • GPU • GPGPU • Programming Concepts and Mappings • Direct3D and OpenGL • NVIDIA CUDA • Case Study: Decoding H.264/AVC • motion compensation • results • Conclusions • Q&A

  29. GPGPU-specific APIs • NVIDIA CUDA • Compute Unified Device Architecture • C-code with annotations compiled to executable code • DirectX 11 Compute Shader • shader execution without rendering • technology preview available in latest DirectX SDK • OpenCl • OpenComputing Language • C++-code with annotations • ATI CTM • Close to The Metal • GPU assembler • depricated

  30. NVIDIA CUDA • General-Purpose GPU Computing Platform • GPU is a super-threaded co-processor • acceleration of massive amounts of GPU threads • Supported on NVIDIA G80 and higher • 50-500EUR price range • No more (mis)use of 3-D API • C-code with annotations for • memory location • host or device functions • thread synchronization • Compilation with CUDA-compiler • split host and device code • linkable object code

  31. NVIDIA CUDA - Example void runGPUTest() { CUT_DEVICE_INIT(); ... float* d_data = NULL; // allocate gpu memory cudaMalloc( (void**) &d_data, size); dim3 dimBlock(8, 8, 1); dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1); // run kernel on gpu transformKernel<<< dimGrid, dimBlock, 0 >>>( d_data ); // download cudaMemcpy( h_data, d_data, size, cudaMemcpyDeviceToHost); ... }

  32. NVIDIA CUDA – Example (2) __ global__ void transformKernel( float* g_odata) { // calculate normalized texture coordinates unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; int2 mv = tex2D(mvtex, x, y); intmx = x + mv.x; int my = y + mv.y; g_odata[y*width + x] = tex2D(reftex, mx, my); }

  33. Programming Model Host (CPU) Device (GPU) Block (1, 1) Grid 1 Thread (0, 0) Thread (1, 0) Thread (2, 0) Kernel 1 Block (2, 0) Block (0, 0) Block (1, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Block (0, 1) Block (1, 1) Block (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (0, 3) Thread (1, 3) Thread (2, 3) Grid 2 Kernel 2

  34. Hardware Model • Multiprocessor – MP (16) • Streaming Processor (8 per MP) • handles one thread • Memory • very fast high-latency • uncached • special memory hardware for constants & texture (cached) • Registers • limited amount

  35. CUDA Threads Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (0, 3) Thread (1, 3) Thread (2, 3) • Each Streaming Processor handles one thread • 240 on GeForce 280GTX! • Smart hardware can schedule thousands of threads on 240 processors • Extremely lightweight • not like CPU threads • Threads per Multiprocessor handled in SIMD manner • each thread executes the same instruction at a given clock cycle • lock-step execution

  36. Lock-step Execution Locked Locked Locked Locked Thread 1 Thread 2 Thread 31 Thread 32 x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x Program Counter Program Counter … Program Counter Program Counter Program Counter x 3 y … z … x 100 y … z … x 200 y … z … x 1 y … z … • Heavy branching needs to be avoided

  37. Overview • Introduction • GPU • GPGPU • Programming Concepts and Mappings • Direct3D and OpenGL • NVIDIA CUDA • Case Study: Decoding H.264/AVC • motion compensation • results • Conclusions • Q&A

  38. Decoding H.264/AVC • Many decoding steps are suitable for parallelization • quantization • transformation • motion compensation • deblocking • color space conversion • Others introduce dependencies • entropy coding • intra prediction

  39. Video Coding Hardware • Specialized on-board 2-D video processing chips • one macroblock at the time • black boxes • limited support for non-windows systems • limited support for various video codecs • e.g. H.264/AVC profiles • partly programmable • GPU • millions of transistors • accessible via 3-D API or General Purpose GPU API

  40. Decoding an H.264/AVC bitstream • H.264/AVC • recent video coding standard • successor of MPEG-4 Visual • Computationally intensive • multiple reference frames (up to 16) • B-pictures • sub-pixel interpolations • Motion compensation, reconstruction, deblocking, and color space conversion • takes up to 80% of total processing time • suitable for execution on the GPU

  41. Pipelined Design for Video Decoding CPU Read Bitstream VLD, IQ, Inverse Transformation Read Bitstream VLD, IQ, Inverse Transformation Intermediary Buffer in System Memory MVs Residue QPs MVs Residue QPs GPU MC, Reconstr.,Deblocking CSC, Visualization MC, Reconstr.,Deblocking CSC, Visualization

  42. Motion Compensation Reference Picture Current Picture Time Input Sequence - (x1,y1) (x2,y2) = (x3,y3) Motion Compensation … Prediction Residual Data Motion Vectors

  43. Motion Compensation: Decoder Reference Picture Time (x1,y1) (x2,y2) + (x3,y3) Motion Compensation … Prediction Residual Data Motion Vectors

  44. Motion Compensation in CUDA Device (GPU) Kernel 1 Block (2, 0) Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Output Array Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2)

  45. Motion Compensation in CUDA Device (GPU) Kernel 1 Block (2, 0) Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Output Array Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2)

  46. Motion Compensation in CUDA Device (GPU) Kernel 1 Block (2, 0) Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Output Array Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2)

  47. Motion Compensation in Direct3D • Put video picture in textures • Use vertices to represent a macroblock • Let texture coordinate point to the texture • Full-pel motion compensation • manipulate texture coordinates • Multiple pixel shaders fill macroblocks and interpolate [0.50,0.30] [0.50,0.40] [0.60,0.30] [0.60,0.40] [0.60,0.40] [0.50,0.50] [0.60,0.50] Reference texture forrasterization process [0.50,0.40]

  48. Interpolation Strategies for Sub-pixel MC Viewable area Vertex Shaders Pixel Shader 1 Vertex Grid + Full-Pel Half-Pel Pixel Shader 2 + Q-Pel Pixel Shader 3

  49. Experimental Results • GPU algorithm scores faster than CPU algorithm • CPU is offloaded,free for other tasks

  50. Conclusions • GPU is an attractive platform for general-purpose computation • flexible, powerful, inexpensive • General-purpose APIs • approach the GPU as a super-threaded co-processor • GPGPU requires lots of parallel jobs • e.g. hundreds to thousands • GPGPU allow faster execution while offloading the CPU • e.g. decoding of H.264/AVC bitstreams • GPGPU techniques are suited for future architectures

More Related