1 / 37

Status – Week 276

Status – Week 276. Victor Moya. Hardware Pipeline. Command Processor. Vertex Shader. Rasterization. Pixel Shader. Fragment Operations and Tests. Command Processor. Recieves commands from the CPU (driver, OpenGL/Direct3D). Fetches data from memory: vertex data (DMA).

ziv
Download Presentation

Status – Week 276

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status – Week 276 Victor Moya

  2. Hardware Pipeline • Command Processor. • Vertex Shader. • Rasterization. • Pixel Shader. • Fragment Operations and Tests.

  3. Command Processor • Recieves commands from the CPU (driver, OpenGL/Direct3D). • Fetches data from memory: vertex data (DMA). • Updates and stores OpenGL/Direct3D render state.

  4. Vertex Shader • Transforms and lits vertex streams. • Vertex shader program (from GPU memory?). • Vertex shader constans (from GPU memory?). • Inputs: vertex data 16x4D • Outputs: vertex data 14x4D

  5. Rasterization • Includes: • Clipping • Divide by w • Affine transform • Primitive assembly • Culling • Setup • Fragment generation. • Recieves vertexs and produces fragments. • Uses OpenGL/Direct3D render state. • Input: vertex (15x4D). • Output: fragments (10x4D).

  6. Pixel Shader • Shades fragments: calculate texture address, read texture, color operations. • Pixel Shader program and constants (from GPU memory?). • Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). • Optional: • Modify depth coordinate (1 Z output). • Render to texture (up to 4 colors outputs). • Input: fragment (12x4D). • Output: color (2x4D).

  7. Fragment Operations and Tests • Includes (OpenGL): • Fog. • Color Sum. • Ownership Test. • Scissor Test. • Alpha Test. • Stencil Test. • Depth Test. • Blend. • Logic Operation. • Accesses framebuffer (GPU memory). Updates framebuffer. • Framebuffer: color, Z and stencil. • OpenGL/Direct3D render state defines operations. • Input: color. • Output: FB updated.

  8. Vertex Shader • The command processor sends a vertex stream to the vertex shaders. • A vertex buffer stores data read from DMA. • A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. • The vertex stream is grouped in primitives and sent to the rasterizer.

  9. Hardware Pipeline

  10. Vertex Shader Architecture • SIMD architecture. Registers are 128b wide, four 32 bit fields. • Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. • 3 different sources of data: • Input stream (~ 16 registers). • Constants (~ 256 registers). • Temporaries (~ 16 registers). • 2 different destinations: • Output stream (~ 15 registers). • Temporaries (~ 16 registers). • Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’.

  11. Vertex Shader Inputs and Outputs

  12. Vertex Shader Architecture

  13. Vertex Shader: NV20 • Exposes programmability of a small part of the geometry pipeline. • Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. • 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode.

  14. NV20: Introduction • Independent vertices. • IEEE single precission FP. • 4 component vectors (x, y, z, w). • Input registers can have their components arbitrarily rearranged/replicated (swizzled). • Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask.

  15. NV20: Program Model

  16. NV20: Input Attributes • Input Attributes: • 16 quad-float vertex source attribute registers. • Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. • Default 0.0 for second and third components, 1.0 for the fourth. • Attributes are persistent. • Only one vertex attribute may be read per program instruction. • Constant memory: • 96 quad floats. • Can only be loaded before vertices are processed. • Only one constant may be read by one program instruction. • The program may not read to constants.

  17. NV20: Input Attributes • Integer address register: • Loaded using ARL. • Indexed constant reads with out-of-range reads returning (0,0,0,0). • Read/Write register file: • 12 quad floats. • Three reads and one write per instruction. • Initialized to (0,0,0,0) per vertex. • Any vector read may be sourced as multiple operands and individually swizzled/negated each time.

  18. NV20: Output attributes • Standard mapping for the fixed function pipeline at the homogeneous clip space point. • Position for clipping. • Vertex color output clamped to the range 0.0 to 1.0. • Fog distance, point size. • 8 texture coordinates. • All instruction writes have an optional 4-component write mask. • Initialized to (0.0, 0.0, 0.0, 1.0).

  19. NV20: Instruction Set. • No branching. • Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks.

  20. NV20: Hardware Implementation • Two blocks: vertex attribute buffer (VAB) and the floating point core.

  21. NV20: VAB • The VAB is responsible for vertex attribute persistence. • 16 input attributes • When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. • The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. • Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. • The transfer of a vertex is triggered by a write to address 0 (vertex position). • To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence.

  22. NV20: VAB

  23. NV20: Floating Point Core • Processes the instruction set. • Multithreaded vector processor operating on quad-float data. • Vertex data read from input buffers and transformed into output buffers (OB). • Same latency for vector and special function units. • Multiple vertex threads are used to hide this latency. • SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. • Special FU: RCP, RSQ, LOG, EXP, LIT. • VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). • 1 instruction per clock and all input/output options have no performance penalty. • All input vectors are available with no latency.

  24. NV20: Float Point Core

  25. Vertex Shader: R300 • 4 vertex shader units. • 1 scalar unit, 1 vector unit. • Registers: • ALU Registers: • Constants: 256 read only vectors. • Temporary: 12 read/write vectors • Input: 16 read only vectors. • Output: 15 write only vectors. • Flow Control Registers: • Integer Constat: 16 read only vectors. • Address: 1 read/write vector. • Loop Counter: 1 scalar. • Boolean Constant: 16 read only bits.

  26. R300: Instructions • Up to 256 instructions long shaders. • Up to 64K executed instructions per vertex. • ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. • Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. • Address Instructions: ARL, ARR. • Graphic Instructions: DST, LIT. • Instructions based in DX9 VS2.0.

  27. NV30: Overview • Supports all VS1 instructions and features. • Beyond VS2? • Condition codes. • Branches and subroutines. • Modifiers: absolute. • User clip support (new output registers CLP0-CLP5). • New instructions. • More registers.

  28. NV30: Overview • Up to 256 instructions per program. • Up to 64K executed instructions per vertex. • 16 temporary registers. • 2 vector address registers. • 256 program parameters (constants).

  29. NV30: Condition Codes • 4 component register: • LT: less than zero. • EQ: equal to zero. • GT: greater than zero. • UN: unordered, for comparisions involving NaN. • Instructions optionally update condition code state: • “C” suffix: DP4C, MOVC. • “CC” pseudo register for update condition codes. • Condition code used in: • Branches and procedure call/return. • Result masking.

  30. NV30: Modifiers • Source: • Swizle • Negate • Absolute • Target • Masking • Conditional masking

  31. NV30: Branching and subroutines • BRA • Unconditional. • Conditional: BRA label (LE.xyww) • Computed (indirect): BRA [A1.z] (GT.x) • Call & return for subroutines. • CAL & RET. • Same options that with branches. • Four levels of subroutin execution. • No parameter stack.

  32. NV30: Clipping • New output registers: o[CLP0]..o[CLP5]. • GL_CLIP_PLANEn enabled. • Clip coordinate n interpolated across the primitive. • Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. • Hardware performs fast trivial reject if all clip coordinats of a primitive are negative.

  33. NV30: New Instructions • ARL: supports loading 4-component A0 and A1 intergre registers now. • ARR: like ARL except rounds rather than truncates before storing integer result in an address register. • BRA, CAL, RET: branching instructions. • COS, SIN: high precision trigonometric functions. • FLR, FRC: floor and fraction of floating point values. • EX2, LG2: high-preccision exponentiation and logarithm functions. • ARA: adds pairs of components of an address register, useful for looping and other operations. • SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. • SSG: “set sign” operation generates a vector holding –1.0 for negative operand components , 0 for zero components, and +1.0 for positive components.

  34. NV30: Instruction List • Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. • Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. • Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. • Branching instructions: BRA, CAL, RET. • Address register instructions: ARL, ARA. • Graphics-oriented instructions: DST, LIT, RCC, SSG. • Minimum/maximum instructions: MAX, MIN

  35. Others • Antialiasing • Anisotropic Filtering (textures). • Line Antialiasing. • Edge Antialiasing • Full Screen Antialiasing (FSAA): • Supersampling. • MultiSampling. • TBDR: Tile Based Deferred Rendering (STMicro PowerVR). • HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation.

More Related