graphics processors l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Graphics processors PowerPoint Presentation
Download Presentation
Graphics processors

Loading in 2 Seconds...

play fullscreen
1 / 40

Graphics processors - PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on

Graphics processors Norm Rubin – compiler architect – normanr@ati.com Size of market Many millions of gpu’s shipped per month The 3d market is entertainment (games) Each new generation of gpu adds enough performance to support a new version of a game.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphics processors' - oshin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
graphics processors

Graphics processors

Norm Rubin – compiler architect –

normanr@ati.com

size of market
Size of market
  • Many millions of gpu’s shipped per month
  • The 3d market is entertainment (games)
  • Each new generation of gpu adds enough performance to support a new version of a game.
  • Each time a game is released, player have to replace hardware to run the game.
  • Game industry is larger then Hollywood.
technology view

cpu

gpu

architecture

Proprietary

Commodity

interfaces

Mutable

Locked down

Technology view

performance / function

Not enough

ok

Too good

how much headroom
How much headroom
  • Pixar uses 100,000 min of compute per min of image
  • Gpu’s are real time so 100,000 = 20 doubles
  • Most optimistic marketing version of Moore’s law – performance doubles every 6 months
  • So there is 10 years to go.
application space
Application space
  • Problems are embarrassingly parallel
  • Problems are big, screen 1000 x 1000, program runs per pixel, including some pixels that are behind others so 10* 1000 * 1000 calls per frame * 20-60 frames per second
  • Run the same program over and over so
  • Gpus are SIMD machines
slide6
SIMD
  • There are many units executing in parallel
    • These are in lock-step, executing the same instruction on different pixels/vertices at the same time
    • Dynamic flow control can cause inefficiencies in such an architecture since different pixels/vertices can take different code paths
    • Dynamic branching is not always a performance win
    • For an if…then…else, need to execute both sides, turning processors on and off.
application space7
Application space
  • Many values are coherent – values in neighbor pixels are close.
  • Compute coherent variables at selected points use interpolation to find the intermediate values
  • Today programmer specifies which variables are coherent by splitting programs in two.
application space8
Application space
  • Common subproblem is texture filtering
    • Evaluate some array of memory around a stencil and combine
    • Provide a small fixed set of stencil patterns in hardware
    • You could think of this as slighty smart memory
    • Hardware support for 1-3 d arrays and several filtering functions
    • Exact stencil patterns and combining operations are proprietary (some look better then others)
application space9
Application space
  • Little communication between processing elements
  • Approximate spatial derivative by 2x2 difference operator
  • Forces all machine designs to work on multiples of four pixels
application space10
Application space
  • Throughput is important
  • Use threading to cover latency
  • The chips can support hundreds of threads, and can switch from thread to thread every cycle
    • No thread switch overhead
    • Hardware scheduler and thread system
    • Compiler knows about threads and splits resources over threads
  • Caches are very different – can only cover spatial locality
programming model
Programming model
  • Performance is much less then users want
  • Min of 100,000 times less
  • Most developers write each program at least four times
    • Xbox
    • Playstation
    • Ati top machine
    • Nvidia top machine
  • Programs are in two parts: Vertex and Pixel shaders.
programming model 2
Programming model 2
  • Programs could be written in a high level language (C like) HLSL/OGL2
  • Or in virtual assembly language (DirectX, …)
    • Almost one dialect per chip
    • While virtual languages but physical resources.
  • developers review virtual machine listings for performance
  • developers ship virtual assembly language.
programming model 3
Programming model 3
  • At game startup – virtual assembly language is JIT compiled to real machine language –
    • Drastic change in resource requirements
    • Somewhat hard to debug
    • Hard to identify performance bottlenecks
  • Even though applications could build code on the fly, developers pretest everything – they want the most performance to get the best looking image. Only approximate what they really want.
slide14

Programmable Pipeline

Vertex Data

(Model space)

Fixed Function Transform andLighting

Vertex Shader

Geometry Stage

Clipping and Viewport Mapping

Texture Stages

Pixel Shader

Rasterizer Stage

Fog, Alpha, Stencil Depth Testing

slide15

Per-Vertex Data

Constants

Position

Normal

Texture Coordinates

Etc.

View Matrix

Projection Matrix

Skin/Bone Matrices

Light Positions

Etc.

Triangle Mesh

Vertex Shader Engine

Temporary Registers

Vertex Shader

Instructions

Position

“Texture” Coordinates

Color(s)

Vertex Processing Flow

slide16

Vertex Shader

  • Input:
    • Program specifies vertex data
      • Position
      • Normal
      • Vertex color
      • Texture coordinate(s)
    • Data is sent to the graphics card and processed by the vertex shader
  • Output
    • Vertex shader computes output quantities
      • Position
      • Vertex color: diffuse and specular
      • Texture coordinate(s)
    • Sent to rasterizer via interpolators
pixel processing flow

Interpolated Values

Constants

“Texture” Coordinates

Color(s)

Light Colors

Ambient Lighting Colors

Etc.

Textures

Pixel Shader

Engine

Pixel Shader

Instructions

Temporary Registers

Color

Multi-Render Target

Pixel Processing Flow
program sizes
Program sizes
  • Most programs are very small
  • 100 virtual instructions would be a large program
  • Basic data type is a four element vector of floats
  • Integer data types are not yet available
  • Dynamic branching is new
  • Small amount of nesting allowed
polygons
polygons
  • Polygon Budget
    • Ruby : 75,000
    • Optico: 50,000
    • Ninja: 25,000
    • Environment: 100,000
    • Props: 50,000
  • Lighting Limits
    • 3 Dynamic lights per shot (1 shadow casting)
    • Lightmaps used for set
  • Animation Limits
    • 35 total blend shapes
    • 5 simultaneous blend shapes
    • 4 weighted bones per vertex
    • Number of on-screen characters limited to 4 at once
shader breakdown
Shader Breakdown
  • Depth of Field
  • Hair
  • Skin
shader breakdown23
Shader Breakdown
  • Glows
  • Motion Blur
  • Reflections
hardware view
Hardware view
  • X1900
  • Xbox 360
  • Both machines are current
slide29

Pixel Shader Processors

Texture Address Units

1 texture address instructionsper unit per clock cycle

Texture

Address

Unit

1

Texture

Address

Unit

2

Texture

Address

Unit

3

Texture

Address

Unit

4

Pixel Shader Processor

Per Clock Cycle:

1 vec3 ADD + input modifier

1 scalar ADD + input modifier

1 vec3 ADD/MUL/MADD

1 scalar ADD/MUL/MADD

1 flow control instruction

Quad Pixel Shader Core

vertex engine
Upgraded to support SM3.0

Dynamic flow control

1,024 instructions (practically unlimited with flow control)

More temporary registers

8 Vertex Shader Processors

Each can handle 2 shader instructions per clock

10 billion instructions per second

Vertex Engine
ring bus memory controller
Ring Bus Memory Controller
  • Supports today’s fastest graphics memory devices
    • GDDR3, 48+ GB/sec
    • GDDR4, The future
  • 512-bit Ring Bus
    • Simplifies layout and enables extreme memory clock scaling
  • New Cache Design
    • Fully Associative for more optimal performance
  • Improved Hyper Z
    • Better compression and hidden surface removal
  • Programmable Arbitration Logic
    • Maximizes memory efficiency
    • Can be upgraded via software
memory channels 4x improvement in random access over x850
Memory Channels - 4x Improvement in Random Access over X850

Radeon X1900

8x32-bitchannels

8 Banks Per Dram

RadeonX850

4x64-bitchannels

4 banks Per Dram

cache design
Cache Design
  • Fully Associative Caches
    • Cache lines can map to any location in external memory
    • Earlier designs used Direct Mapped & N-Way Associative Caches
    • Could only access limited blocks of external memory
  • Texture, Color, Z & Stencil caches are all now fully associative
    • Reduces memory bandwidth requirements
    • Minimizes cache contention stalls
    • Optimized game performance
    • Gains up to 25% clock for clock in fill/bandwidth bound cases

GraphicsMemory

Cache

Direct

Mapped

Cache

GraphicsMemory

Cache

Fully

Associative

Cache

slide34
Xbox
  • 3.2GHz Custom IBM Central Processor
  • Three CPU Cores
  • Two Threads Per core
  • VMX Unit Per Core
  • 128 VMX Registers Per Thread
  • 1MB L2 Cache (Lockable by Graphics Processor)
  • 500MHz Custom ATI Graphics Processor
  • Unified Shader Core
  • 48 ALU’s for Vertex or Pixel Shader processing
  • 16 Filtered & 16 Unfiltered Texture samples per clock
  • 10MB eDRAM Framebuffer
  • 512MB System RAM
  • Unified Memory Architecture (UMA)
  • 128-bit interface
  • 700MHz GDDR3 RAM
architecture

Z/Alpha/Stencil

Processors

10MB

DRAM

Z/Alpha/Stencil

Processors

Architecture

Memory Hub

Texture Cache

Texture

Pipe

Texture

Pipe

Texture

Pipe

Texture

Pipe

Command

Processor

Pipe

Comm

Shader

Interp

Shader

Pipe

(x16)

Shader

Pipe

(x16)

Shader

Pipe

(x16)

Vertex

Grouper

Sequencer

Shader

Interp

256 GB/sec

Primitive

Assembly

Scan

Converter

Vertex Cache

adaptive shader array
Adaptive Shader Array
  • Unified shader architecture
      • One processor type
      • Dynamic load balancing
      • Pixel and vertex processing where and when they’re needed
    • 48 shaders
      • 120 billion operations per second
some interesting problems
Some interesting problems
  • Coherence (branch prediction?)
  • What are the right instructions
  • Can you do non graphics applications
  • Programming language
  • Threading by compiler
  • Off line compile?
implications for programming languages
Implications for programming languages
  • GPU – can convince people to use a new language if you can prove it is faster, even if it means lots of changes
  • Desktop CPU – have to prove it can meet some other (non-performance/function) need
  • Top of the line price for GPU going up- top of the line desktop CPU price going down, lots of change to do cool design.
  • Less need to be backward compatible.
more info
More info
  • http://www.ati.com/developer/index.html