1 / 36

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation. My Goals. Survey history and direction of GPU/PC system architecture Demonstrate the process of system level architectural problem solving Motivate some of you to become architects.

ariane
Download Presentation

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation

  2. My Goals Survey history and direction of GPU/PC system architecture Demonstrate the process of system level architectural problem solving Motivate some of you to become architects

  3. Disclaimers I work for NVIDIA Public Info All numbers and dates approximate Rounding is our friend No bus/processor is 100% efficient, etc, etc All examples are meant to be illustrative Not comprehensive “ there were >40 gfx companies in 1995”

  4. About Me I love games and graphics I love building things

  5. Structure Intro to PC and GPU Architecture A Sampling of Architectures 1996 - Voodoo Graphics / Pentium 2000 - GeForce 256 / P3 2004 - GeForce 6800/ P4 2008 - Geforce GTX280 / Core2 Ideas for the future of the platform

  6. What do architects do? Impose structure on complex design problems Make tradeoffs Validate high risk design bets Structure verification

  7. Why this is a great time to be an Architect Radical design mobility I have contributed to 10 completely new processor designs 7 of which shipped in millions of units. Steep competition Not for everybody Changing the World…no…really! Heterogeneous many core computing is here to stay and it has changed the nature of computing

  8. Design Tension Fixed Function vs. Programmable Scalar vs. Vector Bandwidth vs. Latency In Order vs. Out of Order Limited vs. Unlimited ( virtualized ) resources

  9. Technology Trends CPUs get faster GPUs get faster Interconnects get faster Memory gets faster Memory gets denser Latency increases Feature load increases Physics intrudes more and more All at different rates

  10. The long time horizon The Awesome ideas of now take 2+ years to reach market Awesome depreciates rapidly Predictable Silicon Process Roadmap PC Arch Roadmap 3rd Party Component Roadmap Your capabilities and resources Unpredictable Market Shifts ( commodity prices, supply shocks ) 3rd Party Strategic Errors ( os/platform/partner slips ) Innovative Competition ( N-way struggle for design initiative )

  11. Ultra Simplified PC Anatomy CPU GPU CPU Core Logic GPU GPU Memory GPU Memory System Memory

  12. Ultra Simplified GPU Anatomy Processor DRAM MGMT Processor DRAM MGMT Host Logic Processor DRAM MGMT

  13. Ultra Simplified GPU Anatomy (2) Processor DRAM MGMT Memory Processor DRAM MGMT Host Logic Processor DRAM MGMT Geom Proc Geom Gather Triangle Proc Pixel Proc Z / Blend

  14. GPU Prehistory 1960s – 1970s Single Purpose BIG IRON E&S, GE, Lockheed, … 1980s – 1990s General Purpose BIG IRON Custom ASICs, Workstations SGI, Sun, Intergraph, .. 1994 Maybe we can fit this on a single consumer add-in card?

  15. Fast consumer CPUs with floating point Try 3D rendering in fixed point! PCI VGA and VESA Id Software’s DOOM Contract Fabrication facilities offering .6 micron ASIC design Tools Enabling Technologies in 1994

  16. 1996 3dfx - Voodoo Graphics PIO Programming Model Pure Pipelined Graphics Partial Triangle Setup – FP32 Fixed Point Integer Texture Mapping and Gouraud Shading Z Buffer and Full OpenGL Blending All at 1 PPC, all the time, with no caches 32-bit PCI - .09 GB/s 128-bit EDO 50 Mhz DRAM - .8 GB/s

  17. Voodoo Graphics System Architecture Geom Proc Geom Gather Triangle Proc Pixel Proc Z / Blend CPU GPU TEX Memory TMU CPU Core Logic FBI System Memory FB Memory

  18. Arch Decision – Triangle Setup Target 3D Triangle with texture and Gouraud shading 3 * XYW RGBA ST = 72 bytes/triangle pre setup 32-bit PCI 33Mhz – 90 MB/s 1.25 M triangles / second speed of light ( 1M is magic ) Observe that post setup 3 * XY WRGBAST start values + screen space derivatives + Area 76 bytes/triangle – 1.18M Tris ( still magic ) Setup can be coded on Pentium in ~100 clocks 1M triangles on P100 ( mktg happy ) Data-limited setup on chip - >10% die cost Typical game scenes <<1000 triangles/frame

  19. 2000 NvidiaGeForce 256 Decoupled input queuing Hardware Transform & Lighting FP32 FF Transform FP22 FF Lighting Complex fixed function pixel shading 4 Pipelines AGP4X – 1.06 GB/s 256 Bit DDR 300 Mhz Memory – 19.2 GB/s

  20. GeForce 256 System Architecture Geom Proc Geom Gather Triangle Proc Pixel Proc Z / Blend CPU GPU CPU Core Logic GPU System Memory GPU Memory

  21. Architecture Detail – Combiners Logical fixed function extension of OpenGL Machine Surface Color = Diffuse * Texture + Specular Diffuse Color Texture Specular

  22. Multi Texture If one texture is good, more are better Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or … Diffuse Color Texture Diffuse Color Texture Texture2 0.0 1.0 Texture Specular Specular

  23. Combiners Cascading Mux / SOP / Mux / SOP pipeline Very, flexible, harder to program with deeper nesting Everything is full speed! B MUX D MUX A MUX C MUX AB Partial CD Partial Texture Fog Light Inputs for Next Stage of Pipeline

  24. Programmable Shading But the future was obviously Renderman-like shaders normal surfaceN; color C = { 1.0, 0.5, 0.0 }; normal lightDirection; Ci = C * dot ( surfaceN, lightDirection );

  25. 2004 NvidiaGeForce 6800 Fully general Vertex and Pixel ISA 6 Geometry Processors 16 Pixel Processors Deep recirculating pipelines to hide latency FP32 datapath end to end AGP8X – 2.11 GB/s 256 Bit 700 MhzGDDR3 – 44 GB/s

  26. GeForce 6800 System Architecture Geom Proc Geom Gather Triangle Proc Pixel Proc Z / Blend Physics and AI Scene Mgmt CPU GPU CPU Core Logic GPU System Memory GPU Memory

  27. Architecture Decision – Tex/Shader Structure Problem: Build a general programmable pipeline Optimize for common workloads TEX – BLEND – FOG Common Game Shaders ( eg. Doom 3 )

  28. Plan A – Uncoupled Elegant Small fundamental unit Many “passes” for common shaders TBF TEXMTH TEX BLND BLND Registers Math Texture

  29. Less Elegant Larger Fundamental Unit Single pass for common shaders Good scaling for longer shaders Big perf / area win given workloads Not forward looking Plan B - Coupled Registers Math Texture Math

  30. 2008 -GeForce GTX280 Fully unified programmable architecture 240 instances of the same processor IEEE FP32 and FP64 Gen2 PCIE – 8GB/s 512 bit 1100 Mhz GDDR3 – 144 GB/s

  31. GeForceGTX280 System Architecture Geom Proc Geom Gather Triangle Proc Pixel Proc Z / Blend Physics and AI Scene Mgmt CPU GPU CPU Core Logic GPU System Memory GPU Memory

  32. Architecture Decision – Heterogeneous Computing Support Build a bigger Chip Radically improve ability of GPU to share work with the CPU Thread Local Memory Block Shared Memory Grid 0 Global Memory Sequential Grids in Time . . . . . . Grid 1 Register File

  33. ComputingSupport Add Efficient Thread Launching Add General Load / Store Instructions and Datapath Add Shared Memory Add computational loads to performance design requirements

  34. Future Graphics Directions Higher density Higher refresh Higher dynamic range Ubiquity Lower Power Shaving off the last burrs Global Illumination Higher quality modeling Virtualized resources at interactive rates

  35. Future PC Architecture Directions Highly Integrated – Low Cost Require a minimum visual feature set Web/video/run today’s apps And everyone else Differentiated PCs More bandwidth and more parallel horsepower More mature unified programming models C on CUDA DX11 OpenCL More resource virtualization

  36. Q & A

More Related