slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 PowerPoint Presentation
Download Presentation
GPU and PC System Architecture UC Santa Cruz BSoE – March 2009

Loading in 2 Seconds...

play fullscreen
1 / 36

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation. My Goals. Survey history and direction of GPU/PC system architecture Demonstrate the process of system level architectural problem solving Motivate some of you to become architects.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'GPU and PC System Architecture UC Santa Cruz BSoE – March 2009' - ariane


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

GPU and PC System Architecture

UC Santa Cruz BSoE – March 2009

John Tynefield / NVIDIA Corporation

my goals
My Goals

Survey history and direction of GPU/PC system architecture

Demonstrate the process of system level architectural problem solving

Motivate some of you to become architects

disclaimers
Disclaimers

I work for NVIDIA

Public Info

All numbers and dates approximate

Rounding is our friend

No bus/processor is 100% efficient, etc, etc

All examples are meant to be illustrative

Not comprehensive

“ there were >40 gfx companies in 1995”

about me
About Me

I love games and graphics

I love building things

structure
Structure

Intro to PC and GPU Architecture

A Sampling of Architectures

1996 - Voodoo Graphics / Pentium

2000 - GeForce 256 / P3

2004 - GeForce 6800/ P4

2008 - Geforce GTX280 / Core2

Ideas for the future of the platform

what do architects do
What do architects do?

Impose structure on complex design problems

Make tradeoffs

Validate high risk design bets

Structure verification

why this is a great time to be an architect
Why this is a great time to be an Architect

Radical design mobility

I have contributed to 10 completely new processor designs

7 of which shipped in millions of units.

Steep competition

Not for everybody

Changing the World…no…really!

Heterogeneous many core computing is here to stay and it has changed the nature of computing

design tension
Design Tension

Fixed Function vs. Programmable

Scalar vs. Vector

Bandwidth vs. Latency

In Order vs. Out of Order

Limited vs. Unlimited ( virtualized ) resources

technology trends
Technology Trends

CPUs get faster

GPUs get faster

Interconnects get faster

Memory gets faster

Memory gets denser

Latency increases

Feature load increases

Physics intrudes more and more

All at different rates

the long time horizon
The long time horizon

The Awesome ideas of now take 2+ years to reach market

Awesome depreciates rapidly

Predictable

Silicon Process Roadmap

PC Arch Roadmap

3rd Party Component Roadmap

Your capabilities and resources

Unpredictable

Market Shifts ( commodity prices, supply shocks )

3rd Party Strategic Errors ( os/platform/partner slips )

Innovative Competition ( N-way struggle for design initiative )

ultra simplified pc anatomy
Ultra Simplified PC Anatomy

CPU

GPU

CPU

Core Logic

GPU

GPU Memory

GPU Memory

System Memory

ultra simplified gpu anatomy
Ultra Simplified GPU Anatomy

Processor

DRAM MGMT

Processor

DRAM MGMT

Host Logic

Processor

DRAM MGMT

ultra simplified gpu anatomy 2
Ultra Simplified GPU Anatomy (2)

Processor

DRAM MGMT

Memory

Processor

DRAM MGMT

Host Logic

Processor

DRAM MGMT

Geom

Proc

Geom Gather

Triangle

Proc

Pixel

Proc

Z / Blend

gpu prehistory
GPU Prehistory

1960s – 1970s

Single Purpose BIG IRON

E&S, GE, Lockheed, …

1980s – 1990s

General Purpose BIG IRON

Custom ASICs, Workstations

SGI, Sun, Intergraph, ..

1994

Maybe we can fit this on a single consumer add-in card?

enabling technologies in 1994
Fast consumer CPUs with floating point

Try 3D rendering in fixed point!

PCI

VGA and VESA

Id Software’s DOOM

Contract Fabrication facilities offering .6 micron

ASIC design Tools

Enabling Technologies in 1994
1996 3dfx voodoo graphics
1996 3dfx - Voodoo Graphics

PIO Programming Model

Pure Pipelined Graphics

Partial Triangle Setup – FP32

Fixed Point Integer Texture Mapping and Gouraud Shading

Z Buffer and Full OpenGL Blending

All at 1 PPC, all the time, with no caches

32-bit PCI - .09 GB/s

128-bit EDO 50 Mhz DRAM - .8 GB/s

voodoo graphics system architecture
Voodoo Graphics System Architecture

Geom

Proc

Geom Gather

Triangle

Proc

Pixel

Proc

Z / Blend

CPU

GPU

TEX Memory

TMU

CPU

Core Logic

FBI

System Memory

FB Memory

arch decision triangle setup
Arch Decision – Triangle Setup

Target 3D Triangle with texture and Gouraud shading

3 * XYW RGBA ST = 72 bytes/triangle pre setup

32-bit PCI 33Mhz – 90 MB/s

1.25 M triangles / second speed of light ( 1M is magic )

Observe that post setup

3 * XY WRGBAST start values + screen space derivatives + Area

76 bytes/triangle – 1.18M Tris ( still magic )

Setup can be coded on Pentium in ~100 clocks

1M triangles on P100 ( mktg happy )

Data-limited setup on chip - >10% die cost

Typical game scenes <<1000 triangles/frame

2000 nvidia geforce 256
2000 NvidiaGeForce 256

Decoupled input queuing

Hardware Transform & Lighting

FP32 FF Transform

FP22 FF Lighting

Complex fixed function pixel shading

4 Pipelines

AGP4X – 1.06 GB/s

256 Bit DDR 300 Mhz Memory – 19.2 GB/s

geforce 256 system architecture
GeForce 256 System Architecture

Geom

Proc

Geom Gather

Triangle

Proc

Pixel

Proc

Z / Blend

CPU

GPU

CPU

Core Logic

GPU

System Memory

GPU Memory

architecture detail combiners
Architecture Detail – Combiners

Logical fixed function extension of OpenGL Machine

Surface Color = Diffuse * Texture + Specular

Diffuse Color

Texture

Specular

multi texture
Multi Texture

If one texture is good, more are better

Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or …

Diffuse Color

Texture

Diffuse Color

Texture

Texture2

0.0

1.0

Texture

Specular

Specular

combiners
Combiners

Cascading Mux / SOP / Mux / SOP pipeline

Very, flexible, harder to program with deeper nesting

Everything is full speed!

B MUX

D MUX

A MUX

C MUX

AB Partial

CD Partial

Texture

Fog

Light

Inputs for Next Stage of Pipeline

programmable shading
Programmable Shading

But the future was obviously Renderman-like shaders

normal surfaceN;

color C = { 1.0, 0.5, 0.0 };

normal lightDirection;

Ci = C * dot ( surfaceN, lightDirection );

2004 nvidia geforce 6800
2004 NvidiaGeForce 6800

Fully general Vertex and Pixel ISA

6 Geometry Processors

16 Pixel Processors

Deep recirculating pipelines to hide latency

FP32 datapath end to end

AGP8X – 2.11 GB/s

256 Bit 700 MhzGDDR3 – 44 GB/s

geforce 6800 system architecture
GeForce 6800 System Architecture

Geom

Proc

Geom Gather

Triangle

Proc

Pixel

Proc

Z / Blend

Physics and AI

Scene Mgmt

CPU

GPU

CPU

Core Logic

GPU

System Memory

GPU Memory

architecture decision tex shader structure
Architecture Decision – Tex/Shader Structure

Problem: Build a general programmable pipeline

Optimize for common workloads

TEX – BLEND – FOG

Common Game Shaders ( eg. Doom 3 )

plan a uncoupled
Plan A – Uncoupled

Elegant

Small fundamental unit

Many “passes” for common shaders

TBF

TEXMTH

TEX

BLND

BLND

Registers

Math

Texture

plan b coupled
Less Elegant

Larger Fundamental Unit

Single pass for common shaders

Good scaling for longer shaders

Big perf / area win given workloads

Not forward looking

Plan B - Coupled

Registers

Math

Texture

Math

2008 geforce gtx280
2008 -GeForce GTX280

Fully unified programmable architecture

240 instances of the same processor

IEEE FP32 and FP64

Gen2 PCIE – 8GB/s

512 bit 1100 Mhz GDDR3 – 144 GB/s

geforce gtx280 system architecture
GeForceGTX280 System Architecture

Geom

Proc

Geom Gather

Triangle

Proc

Pixel

Proc

Z / Blend

Physics and AI

Scene Mgmt

CPU

GPU

CPU

Core Logic

GPU

System Memory

GPU Memory

architecture decision heterogeneous computing support
Architecture Decision – Heterogeneous Computing Support

Build a bigger Chip

Radically improve ability of GPU to share work with the CPU

Thread

Local Memory

Block

Shared

Memory

Grid 0

Global

Memory

Sequential

Grids

in Time

. . .

. . .

Grid 1

Register File

computing support
ComputingSupport

Add Efficient Thread Launching

Add General Load / Store Instructions and Datapath

Add Shared Memory

Add computational loads to performance design requirements

future graphics directions
Future Graphics Directions

Higher density

Higher refresh

Higher dynamic range

Ubiquity

Lower Power

Shaving off the last burrs

Global Illumination

Higher quality modeling

Virtualized resources at interactive rates

future pc architecture directions
Future PC Architecture Directions

Highly Integrated – Low Cost

Require a minimum visual feature set

Web/video/run today’s apps

And everyone else

Differentiated PCs

More bandwidth and more parallel horsepower

More mature unified programming models

C on CUDA

DX11

OpenCL

More resource virtualization