1 / 43

第七课 GPU & GPGPU

第七课 GPU & GPGPU. Overview. Traditional Graphics Pipeline Programmable Graphics Pipeline Vertex Shader Fragment (Pixel) Shader Brief Intro of Cg GPGPU (General Purpose GPU). Rasterization and Interpolation. Raster Operations. Generation I: 3dfx Voodoo (1996).

geri
Download Presentation

第七课 GPU & GPGPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 第七课 GPU & GPGPU

  2. Overview • Traditional Graphics Pipeline • Programmable Graphics Pipeline • Vertex Shader • Fragment (Pixel) Shader • Brief Intro of Cg • GPGPU (General Purpose GPU)

  3. Rasterization and Interpolation Raster Operations Generation I: 3dfx Voodoo (1996) • One of the first true 3D game cards • Worked by supplementing standard 2D video card. • Did not do vertex transformations: these were done in the CPU • Did do texture mapping, z-buffering. http://accelenation.com/?ac.id.123.2 Vertex Transforms Primitive Assembly Frame Buffer CPU GPU PCI

  4. Rasterization and Interpolation Raster Operations Generation II: GeForce/Radeon 7500 (1998) • Main innovation: shifting the transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI http://accelenation.com/?ac.id.123.5 Vertex Transforms Primitive Assembly Frame Buffer GPU AGP

  5. Rasterization and Interpolation Raster Operations Generation III: GeForce3/Radeon 8500(2001) • For the first time, allowed limited amount of programmability in the vertex pipeline • Also allowed volume texturing and multi-sampling (for antialiasing) http://accelenation.com/?ac.id.123.7 Vertex Transforms Primitive Assembly Frame Buffer GPU AGP Small vertex shaders

  6. Rasterization and Interpolation Raster Operations Generation IV: Radeon 9700/GeForce FX (2002) • This generation is the first generation of fully-programmable graphics cards • Different versions have different resource limits on fragment/vertex programs http://accelenation.com/?ac.id.123.8 Vertex Transforms Primitive Assembly Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor

  7. Traditional Graphics PipeLine Graphics State CPU GPU Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) • A simplified graphics pipeline • Note that pipe widths vary • Many caches, FIFOs, and so on not shown Application Transform& Light AssemblePrimitives Rasterize Shade Vertices (3D) VideoMemory(Textures) Render-to-texture

  8. Pipeline : Transform • Transform & light • Transform from “world space” to “image space” • Compute per-vertex lighting

  9. ModelView Transformation • Vertices mapped from object space to world space • M = model transformation (scene) • V = view transformation (camera) Each matrix transform is applied to each vertex in the input stream. Think of this as a kernel operator. X Y Z 1 X’ Y’ Z’ W’ M * V *

  10. Lighting Lighting information is combined with normals and other parameters at each vertex in order to create new colors. Color(v) = emissive + ambient + diffuse + specular Each term in the right hand side is a function of the vertex color, position, normal and material properties.

  11. Pipeline :Rasterizer • Rasterizer • Convert geometric rep. (vertex) to image rep. (fragment) • Fragment = image fragment • Pixel + associated data: color, depth, stencil, etc. • Interpolate per-vertex quantities across pixels

  12. Pipeline: Shade • Fragment processors (multiple in parallel) • Compute a color for each pixel • Optionally read colors from textures (images)

  13. Programmable vertex processor! The ModernGraphics Pipeline Graphics State CPU GPU VertexProcessor FragmentProcessor Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application Transform& Light AssemblePrimitives Rasterize Shade Vertices (3D) VideoMemory(Textures) Render-to-texture • Programmable pixel processor!

  14. Programmable primitive assembly! The CurrentGraphics Pipeline Graphics State CPU GPU GeometryProcessor Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application VertexProcessor AssemblePrimitives Rasterize FragmentProcessor Vertices (3D) VideoMemory(Textures) Render-to-texture • More flexible memory access!

  15. NVIDIA GeForce 6800 3D Pipeline Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition

  16. Precision • 32-bit IEEE floating-point throughout pipeline • Framebuffer • Textures • Fragment processor • Vertex processor • Interpolants

  17. Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Latest GPUs: Vertex Texture Fetch • Random access memory for vertices • Gather (But not from the vertex stream itself)

  18. Vertex processor capabilities • 4-vector FP32 operations • Condition codes + true data-dependent control flow • Conditional branches, subroutine calls, jump table • Useful for avoiding extra work, e.g.: • Don’t do animation, skinning if vertex will be clipped • Do displacement mapping only for vertices near silhouette • Transcendental arithmetic instructions (e.g. COS) • User clip-plane support • Texture reads (up to 4 textures, unlimited lookups)

  19. Vertex processor limitations • No arbitrary memory write • No “vertex kill” • Can put vertex off-screen • Can make degenerate primitives • Only 32-bit texture formats supported

  20. Fragment Processor • Fully programmable (SIMD) • Processes 4-component vectors (RGBA / XYZW) • Random access memory read (textures) • Capable of gather but not scatter • RAM read (texture fetch), but no RAM write • Output address fixed to a specific pixel • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Direct output (fragment processor is at end of pipeline)

  21. Fragment processor: texture mapping • Texture reads are just another instruction • Allows computed texture coordinates, nested to arbitrary depth • This is a big difference w/ NVIDIA and ATI right now • Allows multiple uses of a single texture unit • Optional LOD control – can specify filter extent • Think of it as a memory-read instruction, with optional user-controlled filtering

  22. Fragment processor capabilities • Dynamic branching • Conditional fragment-kill instruction • Read access to window-space position • Read/write access to fragment Z (but not stencil) • Multiple render targets • Built-in derivative instructions • Partial derivatives w.r.t. screen-space x or y • Useful for anti-aliasing shaders • FP32, FP16, and fixed-point data

  23. Fragment processor limitations • Dynamic branching less efficient than vertex proc. • Especially for non-coherent branching (<~ 30x30 pixels) • Can do a lot with condition codes • No indexed reads from registers • I.e., no indexed arrays • Must use texture reads instead • No arbitrary memory write

  24. GPU vendor differences • Note: this slide will be dated almost instantly • NVIDIA: as described in previous slides • ATI hardware today (1900XT current high-end part): • No vertex texture fetch (but good render-to-vertex-array) • Far fewer levels of computed texture coordinates • Better at fine-grained (less coherent) dynamic branching • ATI Xenos (Xbox 360 chip): • Unified shader model: vertex proc == pixel proc • Scatter support: shaders can write arbitrary memory loc

  25. Cg : C for Graphics • Cg is a high-level GPU programming language • Designed by NVIDIA and Microsoft • Competes with the (quite similar) GL Shading Language, a.k.a GLslang

  26. Programming in assembly is painful Assembly Cg …FRC R2.y, C11.w; ADD R3.x, C11.w, -R2.y; MOV H4.y, R2.y; ADD H4.x, -H4.y, C4.w; MUL R3.xy, R3.xyww, C11.xyww; ADD R3.xy, R3.xyww, C11.z; TEX H5, R3, TEX2, 2D; ADD R3.x, R3.x, C11.x; TEX H6, R3, TEX2, 2D;… … L2weight = timeval – floor(timeval); L1weight = 1.0 – L2weight; ocoord1 = floor(timeval)/64.0 + 1.0/128.0; ocoord2 = ocoord1 + 1.0/64.0; L1offset = f2tex2D(tex2, float2(ocoord1, 1.0/128.0)); L2offset = f2tex2D(tex2, float2(ocoord2, 1.0/128.0)); … • Easier to read and modify • Cross-platform • Combine pieces • etc.

  27. Some points in the design space • CPU languages • C – close to the hardware; general purpose • C++, Java, lisp – require memory management • RenderMan – specialized for shading • Real-time shading languages • Stanford shading language • Creative Labs shading language

  28. Design strategy • Start with C (and a bit of C++) • Minimizes number of decisions • Gives you known mistakes instead of unknown ones • Allow subsetting of the language • Add features desired for GPU’s • To support GPU programming model • To enable high performance • Tweak to make it fit together well

  29. How are GPUs different from CPUs? • GPU is a stream processor • Multiple programmable processing units • Connected by data flows VertexProcessor FragmentProcessor FramebufferOperations Assembly &Rasterization Application Framebuffer Textures

  30. How are GPUs different from CPUs? • Greater variation in basic capabilities • Most processors don’t yet support branching • Vertex processors don’t support texture mapping • Some processors support additional data types • Compiler can’t hide these differences • Least-common-denominator is too restrictive • Cg exposes differences via language profiles(list of capabilities and data types) • Over time, profiles will converge

  31. How are GPUs different from CPUs? • Optimized for 4-vector arithmetic • Useful for graphics – colors, vectors, texcoords • Easy way to get high performance/cost • C philosophy says: expose these HW data types • Cg has vector data types and operationse.g. float2, float3, float4 • Makes it obvious how to get high performance • Cg also has matrix data typese.g. float3x3, float3x4, float4x4

  32. How are GPUs different from CPUs? • No support for pointers • Arrays are first-class data types in Cg • No integer data type • Cg adds “bool” data type for boolean operations • This change isn’t obvious except when declaring vars

  33. Cg basic data types • All profiles: • float • bool • All profiles with texture lookups: • sampler1D, sampler2D, sampler3D,samplerCUBE • NV_fragment_program profile: • half -- half-precision float • fixed -- fixed point [-2,2)

  34. Cg Example • The following fragment program implements a (very) simple toon shader • Flat 3-tone shading • Highlight • Base color • Shadow • Black silhouettes

  35. Cg Example – part 1 // In: // eye_space position = TEX7 // eye space T = (TEX4.x, TEX5.x, TEX6.x) denormalized // eye space B = (TEX4.y, TEX5.y, TEX6.y) denormalized // eye space N = (TEX4.z, TEX5.z, TEX6.z) denormalized fragout frag program main(vf30 In) { float m = 30; // power float3 hiCol = float3( 1.0, 0.1, 0.1 ); // lit color float3 lowCol = float3( 0.3, 0.0, 0.0 ); // dark color float3 specCol = float3( 1.0, 1.0, 1.0 ); // specular color // Get eye-space eye vector. float3 e = normalize( -In.TEX7.xyz ); // Get eye-space normal vector. float3 n = normalize(float3(In.TEX4.z, In.TEX5.z, In.TEX6.z));

  36. Cg Example – part 2 float edgeMask = (dot(e, n) > 0.4) ? 1 : 0; float3 lpos = float3(3,3,3); float3 l = normalize(lpos - In.TEX7.xyz); float3 h = normalize(l + e); float specMask = (pow(dot(h, n), m) > 0.5) ? 1 : 0; float hiMask = (dot(l, n) > 0.4) ? 1 : 0; float3 ocol1 = edgeMask * (lerp(lowCol, hiCol, hiMask) + (specMask *specCol)); fragout O; O.COL = float4(ocol1.x, ocol1.y, ocol1.z, 1); return O; }

  37. GPGPU • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation

  38. Motivation: Computational Power • GPUs are fast… • 3.0 GHz dual-core Pentium4: 24.6 GFLOPS • NVIDIA GeForceFX 7800: 165 GFLOPs • 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s • ATI Radeon X850 XT Platinum Edition: 37.8 GB/s • GPUs are getting faster, faster • CPUs: 1.4× annual growth • GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

  39. Motivation: Computational Power

  40. Motivation: Flexible and Precise • Modern GPUs are deeply programmable • Programmable pixel, vertex, video engines • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications

  41. Motivation: The Potential of GPGPU • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

  42. Problems: Difficult To Use • GPUs designed for & driven by video games • Programming model unusual • Programming idioms tied to computer graphics • Programming environment tightly constrained • Underlying architectures are: • Inherently parallel • Rapidly evolving (even in basic feature set!) • Largely secret • Can’t simply “port” CPU code!

  43. GPGPU • Why GPU for General Purpose Computing? • How Programming?

More Related