a single unified shader gpu microarchitecture for embedded systems l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Single (Unified) Shader GPU Microarchitecture for Embedded Systems PowerPoint Presentation
Download Presentation
A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

Loading in 2 Seconds...

play fullscreen
1 / 33

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems - PowerPoint PPT Presentation


  • 278 Views
  • Uploaded on

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems. Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer Architecture UPC. Roger Espasa Intel DEG Barcelona. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Single (Unified) Shader GPU Microarchitecture for Embedded Systems' - Faraday


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a single unified shader gpu microarchitecture for embedded systems

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

Victor Moya, Carlos González, Jordi Roca, Agustín Fernández

Department of Computer Architecture UPC

Roger Espasa

Intel DEG Barcelona

introduction
Introduction
  • Graphics and specifically 3D graphics have become an important element in current PDA, mobile phone and other handheld systems
    • OpenGL ES: A simplified OpenGL specification for embedded systems
  • The classic GPU architecture for the PC is not suited for embedded systems
    • Low power
    • Low area budget
  • We propose a single unified shader GPU architecture for embedded systems
outline
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
outline4
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
attila classic for pcs
Attila Classic for PCs
  • Optimized for large resolutions
    • Above 1024x768
  • Optimized for high performance
  • High power requirements
    • No power optimizations
    • 100+ watts on current high-end GPUs
  • Large area budget
    • 300+ million transistors on current high-end GPUs
  • Large dedicated of memory bandwidth
    • 40+ GB/s on current high-end GPUs
  • Specialized Shader Units
    • 2 to 8 vertex shader units
    • 1 to 6 fragment shader units
attila pc

Vertex Fetch

Attila PC

Vertex

Shader

Vertex

Shader

Primitive Assembly

Clipping

Specialized

Shaders

Triangle Setup

Rasterization

HierarchicalZ

Fragment

Shader

Fragment

Shader

Four fragments processed in parallel

ROP

ROP

Memory

Controller

Memory

Controller

outline7
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
embedded requirements
Embedded Requirements
  • Optimized for small resolutions
    • 320x240 to 640x480
  • Optimized for low power
    • Reduce frequency
    • Power optimizations
    • Improve efficiency
  • Small area budget
    • Remove non crucial hardware
  • Low available bandwidth
  • Reduced shading power
  • Reduce design complexity
attila embedded
Attila Embedded
  • No Hierarchical Z
  • No Z compression
  • Single unified shader
    • 1 SIMD ALU
    • Multithreaded
      • 16 threads of four vertex/triangle/fragment elements
      • 16 128-bit registers for temporal storage available per thread
    • Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles
    • 4 KB Texture Cache
  • ROP
    • One z and one color values updated per cycle in the framebuffer (a fragment quad each 4 cycles).
  • Single 64-bit DDR channel
    • Limited by current simulator implementation
    • Assimilated to small (1 MB) embedded DRAM
  • 32-bit high latency bus to large system memory for textures
attila embedded10
Attila Embedded

Vertex Fetch

Single Unified Shader

Primitive Assembly

Scheduler

Distributor

Shader

Clipping

Rasterization

Memory

Controller

ROP

Single fragment per cycle pipeline

Vertices

Triangles

Fragments

outline11
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
triangle setup in the shader
Triangle Setup in the Shader
  • 2D Homogeneous Rasterization
    • Olano & Greer
  • Triangle setup algorithm:
    • Calculate setup matrix from triangle vertex matrix
    • Calculate interpolation equation for fragment Z
    • Cull triangles based on their facing direction (area sign)
  • Algorithm suited for a SIMD implementation in the Unified Shader
  • Inputs:
    • Four 3 component vectors as input for the triangle vertex positions
  • Outputs:
    • Three 4 component vectors as output for the triangle edge and z interpolation equation coefficients.
    • One signed triangle area register as output for face culling stage
  • 26 Instruction Triangle Shader program
triangle setup in the shader13
Triangle Setup in the Shader
  • Benefits
    • Reduce area
      • No specialized hardware required for Triangle setup
    • Reduce design complexity
    • Improve efficiency
      • Graphic workload in embedded applications may not fully utilize the triangle setup specialized hardware in most cases
      • Higher utilization of the shader
  • Costs
    • Shader workload increases
    • Rerouting of the rasterization pipeline required
outline14
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
slide15

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Trace

GLPlayer

Statistics

Vendor OpenGL Driver

Vendor OpenGL Driver

ATTILA OpenGL Driver

Signal Traffic

ATI R520/NVidia G70

ATI R520/NVidia G70

ATTILA Simulator

Framebuffer

Framebuffer

Framebuffer

Signal Visualizer

CHECK!

CHECK!

slide16

Collect

Verify

Simulate

Analyze

OpenGL Application

  • GLInterceptor
  • Capture a trace of OpenGL API calls from a real game

GLInterceptor

Trace

GLPlayer

Statistics

Vendor OpenGL Driver

Vendor OpenGL Driver

ATTILA OpenGL Driver

Signal Traffic

ATI R520/NVidia G70

ATI R520/NVidia G70

ATTILA Simulator

Framebuffer

Framebuffer

Framebuffer

Signal Visualizer

CHECK!

CHECK!

slide17

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

  • GLPlayer
  • Reproduce the captured trace

Trace

GLPlayer

Statistics

Vendor OpenGL Driver

Vendor OpenGL Driver

ATTILA OpenGL Driver

Signal Traffic

ATI R520/NVidia G70

ATI R520/NVidia G70

ATTILA Simulator

Framebuffer

Framebuffer

Framebuffer

Signal Visualizer

CHECK!

CHECK!

slide18

Collect

Verify

Simulate

Analyze

  • OpenGL Library
    • - Transform Fixed Function API into Shader code
    • - 200 API calls supported
    • - ARB Vertex and Fragment extensions
    • - Alpha and Fog emulated via Shader code
  • Driver
    • - Low level interface to GPU hardware
    • - Attila memory management

OpenGL Application

GLInterceptor

Trace

GLPlayer

Statistics

Vendor OpenGL Driver

Vendor OpenGL Driver

ATTILA OpenGL Driver

Signal Traffic

ATI R520/NVidia G70

ATI R520/NVidia G70

ATTILA Simulator

Framebuffer

Framebuffer

Framebuffer

Signal Visualizer

CHECK!

CHECK!

slide19

Collect

Verify

Simulate

Analyze

  • ATTILA Simulator
    • - Detailed cycle-by-cycle simulation of all pipeline stages
    • - 20 boxes, modeling a 100-deep pipeline
    • - Execute@Execute: functionality embedded at each pipeline stage

OpenGL Application

GLInterceptor

Trace

GLPlayer

Statistics

Vendor OpenGL Driver

Vendor OpenGL Driver

ATTILA OpenGL Driver

Signal Traffic

ATI R520/NVidia G70

ATI R520/NVidia G70

ATTILA Simulator

Framebuffer

Framebuffer

Framebuffer

Signal Visualizer

CHECK!

CHECK!

spot the differences
Spot the differences

Attila

NVidia GeForce FX 5900XT

outline21
Outline
  • ATTILA PC
  • ATTILA Embedded
  • Triangle Setup in the Shader Unit
  • ATTILA Simulation Framework
  • Results
benchmark
Benchmark
  • Unreal Tournament 2004
    • NOT AN EMBEDDED BENCHMARK
      • Up to 300K vertices per frame!
    • Fixed function OpenGL API
      • Vertex and fragments shaders generated by our library
    • 320x240 resolution
    • 140 of 450 frames simulated
    • 100+ frames ~ 1 day simulation
      • On a Xeon P4 @ 2.0Ghz
configurations
Configurations
  • We have evaluated
    • 3 middle-end to low-end PC GPU configurations
    • 2 integrated on chipset GPUs and high-end PDA GPUs configurations
    • 4 embedded low-end GPUs configurations
  • We tried to keep a balance between memory bandwidth and shading computing power
    • From 4 to no vertex shader units
    • From 2 quad fragment shader units to a single unified shader unit
    • From four to one 64-bit DDR memory channels
    • Store framebuffer in small (1 MB) GPU memory and textures in system memory
  • Halved the frequency for embedded systems
    • Restricted design rules
    • Reduce power consumption
  • Removed all optional features at the low end
    • Hierarchical Z
    • Z compression
    • Specialized Triangle Setup hardware
performance
Performance
  • Average of 20 frames per second at 320x240 for the lower end single shader configurations
efficiency
Efficiency
  • The limiting factor for PC and high embedded configurations is memory bandwidth
    • Shaders underutilized for the evaluated benchmark
  • The limiting factor for low end configurations is shading processing
    • Memory bandwidth could be further reduced
  • Caches seem over dimensioned for the low-end embedded configurations
shaded triangle setup performance
Shaded Triangle Setup Performance
  • No overhead on fragment limited benchmarks
  • 16% less performance in vertex and triangle limited traces
conclusion
Conclusion
  • The Attila Embedded achieves 20 frames per second on a single unified shader architecture at a 320x240 resolution when using a year old PC benchmark
    • 1 MB of fast embedded DRAM provides more than enough bandwidth for framebuffer accesses
      • Texture data stored in system memory
    • 16% performance reduction when removing the specialized Triangle Setup unit in the worst tested case
attila pc31
Attila PC

Shader

Vertex Fetch

Shader

Scheduler

Distributor

Primitive Assembly

Clipping

Shader

Triangle Setup

Rasterization

Shader

HierarchicalZ

Unified Shader Pool

ROP

ROP

ROP

ROP

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller