1 / 43

Graphics Hardware

Graphics Hardware. Kurt Akeley CS248 Lecture 14 8 November 2007 http://graphics.stanford.edu/courses/cs248-07/. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. SP. TF. TF. TF. TF. TF. TF. TF. TF. L1. L1. L1. L1. L1. L1. L1. L1. L2. L2. L2. L2.

lthayer
Download Presentation

Graphics Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007 http://graphics.stanford.edu/courses/cs248-07/

  2. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Implementation = abstraction (from lecture 2) Application Application Vertex assembly Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex operations Primitive assembly Thread Processor Primitive operations Rasterization Fragment operations Framebuffer NVIDIA GeForce 8800 OpenGL Pipeline Source : NVIDIA

  3. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Correspondence (by color) Fixed-function assembly processors Application-programmable parallel processor Application Application Vertex assembly this was missing Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex operations Primitive assembly Thread Processor Primitive operations Fixed-function framebuffer operations Rasterization(fragment assembly) Fragment operations Framebuffer NVIDIA GeForce 8800 OpenGL Pipeline

  4. Why does graphics hardware exist? Special-purpose hardware tends to disappear over time • Lisp machines and CAD workstations of the 80s • CISC CPUs iAPX432(circa 1982)www.dvorak.org/blog/ Symbolics Lisp Machines(circa 1984)www.abstractscience.freeserve.co.uk/symbolics/photos/

  5. Why does graphics hardware exist? Graphics acceleration has been around for 40 years. Why do GPUs remain? Confluence of four things: • Performance differentiation • GPUs are much faster than CPUs at 3-D rendering tasks • Work-load sufficiency • The accelerated 3-D rendering tasks make up a significant portion of the overall processing (thus Amdahl’s law doesn’t limit the resulting performance increase). • Strong market demand • Customer demand for 3-D graphics performance is strong • Driven by the games market • Ubiquity • With the help of standardized APIs/architectures (OpenGL and Direct3D) GPUs have achieved ubiquity in the PC market • Inertia now works in favor of continued graphics hardware

  6. NVIDIA 8800 Ultra

  7. NVIDIA performance trends Yearly Growth is well above 1.5 (Moore’s Law)

  8. SGI performance trends (depth buffered) Yearly Growth well above 1.5 (Moore’s Law)

  9. CPU performance CAGR has been slowing Source: Hennessy and Patterson

  10. The situation could change … CPUs are becoming much more parallel • CPU performance increase (1.2x to 1.5x per year) is low compared with the GPU increase (1.7x to 2x per year). • This could change now with CPU parallelism (many-core) The vertex pipeline architecture is getting old • Approaches such as ray tracing offer many advantages, but the vertex pipeline is poorly optimized for them • The work-load argument is somewhat circular, because the brute-force algorithms employed by GPUs inflate their own performance demands GPUs have and will continue to evolve • But a revolution is always possible

  11. Outline The rest of this lecture is organized around the four ideas that most informed the design of modern GPUs (as enumerated by David Blythe in this lecture’s reading assignment): • Parallelism • Coherence • Latency • Programmability I’ll continue to use the NVIDIA 8800 as a specific example

  12. Parallelism

  13. Graphics is “embarrassingly parallel” Application struct { float x,y,z,w; float r,g,b,a;} vertex; Many separate tasks (the types I keep talking about) Vertex assembly Vertex operations struct { vertex v0,v1,v2 } triangle; Primitive assembly No “horizontal” dependencies, few “vertical”(in-order execution) Primitive operations struct { short int x,y; float depth; float r,g,b,a;} fragment; Rasterization Fragment operations Framebuffer struct { int depth; byte r,g,b,a;} pixel; Display

  14. Data Parallelism Data and task parallelism Application Data parallelism • Simultaneously doing the same thing to similar data • E.g., transforming vertexes • Some variance in “same thing” is possible Task parallelism • Simultaneously doing different things • E.g., the tasks (stages) of the vertex pipeline Task Parallelism Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display

  15. Command Processor Round-robin Aggregation Trend from pipeline to data parallelism Coord, normal Transform Coordinate Transform Lighting Clip testing Clipping state 6-plane Frustum Clipping Divide by w (clipping) Viewport Prim. Assy. Backface cull Divide by w Viewport SGI 4D/GTX(1988) Clark “Geometry Engine”(1983) SGI RealityEngine(1992)

  16. Load balancing Application Easy for data parallelism Challenging for task parallelism • Static balance is difficult to achieve • But is insufficient • Mode changes affect execution time (e.g., complex lighting) • Worse, data can affect execution time (e.g., clipping) Unified architectures ease pipeline balance • Pipeline is virtual, processors assigned as required • 8800 is unified Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display

  17. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Unified pipeline architecture Application-programmable parallel processor Application Application Vertex assembly this was missing Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex operations Primitive assembly Thread Processor Primitive operations Rasterization(fragment assembly) Fragment operations Framebuffer NVIDIA GeForce 8800 OpenGL Pipeline

  18. Queueing Application FIFO buffering (first-in, first-out) is provided between task stages • Accommodates variation in execution time • Provides elasticity to allow unified load balancing to work FIFOs can also be unified • Share a single large memory with multiple head-tail pairs • Allocate as required Vertex assembly FIFO Vertex operations FIFO Primitive assembly FIFO

  19. In-order execution Work elements must be sequence stamped Can use FIFOs as reorder buffers as well

  20. Coherence

  21. Two aspects of coherence Data locality • The data required for computation are “near by” Computational coherence • Similar sequences of operations are being performed

  22. Data locality Application Prior to texture mapping: • Vertex pipeline was a stream processor • Each work element (vertex, primitive, fragment) carried all the state it needed • Modal state was local to the pipeline stage • Assembly stages operated on adjacent work elements • Data locality was inherent in this model Post texture mapping: • All application-programmable stages have memory access (and use them) • So the vertex pipeline is no longer a stream processor • Data locality must be fought for … Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display

  23. Post-texture mapping data locality (simplified) Modern memory (DRAM) operates in large blocks • Memory is a 2-D array • Access is to an entire row To make efficient use of memory bandwidth all the data in a block must be used Two things can be done: • Aggregate read and write requests • Memory controller and cache • Complex part of GPU design • Organize memory contents coherently (blocking)

  24. Texture Blocking Address base s1 t1 s2 t2 s3 t3 6D Organization Cache Line Size Cache Size 4x4 blocks 4x4 texels (s1,t1) (s2,t2) (s3,t3) Source: Pat Hanrahan

  25. struct { float x,y,z,w; float r,g,b,a;} vertex; struct { float x,y,z,w; float r,g,b,a;} vertex; Computational coherence Data parallelism is computationally coherent • Simultaneously doing the same thing to similar data • Can share a single instruction sequencer with multiple data paths: Instructionfetch andexecute SIMD – Single Instruction Multiple Data

  26. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB SIMD processing One of eight 16-wide SIMD processors Application this was missing Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Why not use one 128-wide processor? Thread Processor NVIDIA GeForce 8800

  27. SIMD conditional control flow The “shader” abstraction operates on each data element independently But SIMD implementation shares a single execution unit across multiple data elements If data elements in the same SIMD unit branch differently the execution unit must follow both paths (sequentially) The solution is predication: • Both paths are executed • Data paths are enabled only during their selected path • Can be nested • Performance is obviously lost! SIMD width is a compromise: • Too wide  too much performance loss due to predication • Too narrow  inefficient hardware implementation

  28. Latency

  29. Again two issues Overall rendering latency • Typically measured in frames • Of concern to application programmers • Short on modern GPUs (more from Dave Oldcorn on this) • But GPUs with longer rendering latencies have been designed • Fun to talk about in a graphics architecture course Memory access latency • Typically measured in clock cycles (and reaching thousands of those) • Of direct concern to GPU architects and implementors • But useful for application programmers to understand too!

  30. Multi-threading Another kind of processor virtualization • Unified GPUs share a single execution engine among multiple pipeline (task) stages • Equivalent to CPU multi-tasking • Multi-threading shares a single execution engine among multiple data-parallel work elements • Similar to CPU hyper-threading The 8800 Ultra multi-threading mechanism is used to support both multi-tasking and data-parallel multi-threading A thread is a data structure: More live registers mean more memory usage struct { int pc; // program counter float reg[n]; // live register state enum ctxt; // context information …} thread;

  31. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Multi-threading Application this was missing Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Programmability Thread Processor NVIDIA GeForce 8800

  32. BlockedThreads ReadytoRunThreads struct { float x,y,z,w; float r,g,b,a;} vertex; struct { float x,y,z,w; float r,g,b,a;} vertex; Multi-threading hides latency Memory data available (dependency resolved) Memory reference (or resulting data dependency) Processor stalls if no threads are ready to run. Possible result of large thread context (too many live registers) Instructionfetch andexecute

  33. Cache and thread store CPU • Uses cache to hide memory latency • Caches are huge (many MBs) GPU • Uses cache to aggregate memory requests and maximize effective bandwidth • Caches are relatively small • Uses multithreading to hide memory latency • Thread store is large Total memory usage on CPU and GPU chips is becoming similar …

  34. Programmability

  35. Programmability trade-offs Fixed-function: • Efficient in die area and power dissipation • Rigid in functionality • Simple Programmable: • Wasteful of die area and power • Flexible and adaptable • Able to manage complexity

  36. Programmability is not new Application The Silicon Graphics VGX (1990) supported programmable vertex, primitive, and fragment operations. • These operations are complex and require flexibility and adaptability • The assembly operations are relatively simple and have few options • Texture fetch and filter are also simple and benefit from fixed-function implementation What is new is allowing application developers to write vertex, primitive, and fragment shaders Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer OpenGL Pipeline

  37. Questions

  38. Why insist on in-order processing? Even Direct3D 10 does Testability (repeatability) Invariance for multi-pass rendering (repeatability) Utility of painter’s algorithm State assignment!

  39. Why can’t fragment shaders access the framebuffer? Application Equivalent to: why do other people’s block diagrams distinguish between fragment operations and framebuffer operations? Simple answer: cache consistency Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer OpenGL Pipeline

  40. Why hasn’t tiled rendering caught on? It seems very attractive: • Small framebuffer (that can be on-die in some cases) • Deep framebuffer state (e.g., for transparency sorting) • High performance Problems: • May increase rendering latency • Has difficulty with multi-pass algorithms • Doesn’t match the OpenGL/Direct 3D abstraction

  41. Summary Parallelism • Graphics is inherently highly data and task parallel • Challenges include in-order execution and load balancing Coherence • Streaming is inherently data and instruction coherent • But texture fetch breaks streaming model / data coherence • Reference aggregation and memory layout restore data coherence Latency • Modern GPU implementations have minimal rendering latency • Multithreading (not caching) hides (the large) memory latency Programmability • “Operation” stages are (and have long been) programmable • Assembly stages, texture filtering, and ROPs typically are not • Application programmability is new

  42. Assignments Next lecture: Performance Tuning and Debugging (guest lecturer Dave Oldcorn, AMD) Reading assignment for Tuesday’s class: • Sections 2.8 (vertex arrays) and 2.9 (buffer objects) of the OpenGL 2.1 specification Short office hours today

  43. End

More Related