slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, PowerPoint Presentation
Download Presentation
Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs,

Loading in 2 Seconds...

play fullscreen
1 / 19

Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU. Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, Intel Corporation. Graphics Applications.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs,' - terri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU

Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar

Microprocessor Research Labs,

Intel Corporation

graphics applications
Graphics Applications
  • Computational intensive graphics applications are becoming increasingly popular
    • Computer-Aided Design
      • From Airplanes to Cars
    • Visualization of massive quantities of Data
    • Visual Simulators e.g. Training Pilots
    • Fancier Graphical User Interfaces
    • And, of course, Games
  • And this trend is continuing
    • As high-end applications become more mainstream

Parallel Architecture and Compilation Techniques, 2003

graphics pipeline
Graphics Pipeline

Transform

  • Vertex Shaders
  • Operate on every vertex in the scene
  • Effects like
    • Blur
    • Diffuse and specular reflection
  • Pixel Shaders
  • Operate on every pixel
  • Effects like
    • Texturing
    • Fog blending

Lighting

Clipping

Rasterization

Texture Mapping

Display

3D Application

OpenGL Or DirectX

Scene

Compositing

Parallel Architecture and Compilation Techniques, 2003

vertex and pixel shaders
Vertex and Pixel Shaders
  • Need to operate millions of times a second
    • Small programs
  • Typically run on the graphics cards
  • However most desktops do not have graphics cards that support programmable shaders
  • This work focuses on running Vertex Shaders on the main CPU
    • Pixel shaders have very high computational and bandwidth requirements
    • Graphics applications are designed to adapt to the available features and performance

Parallel Architecture and Compilation Techniques, 2003

goals
Goals
  • Improving the performance of Vertex Shaders on the main CPU
    • Analyze the performance on today’s CPU
    • Better Compiler Optimizations
    • Additional Architectural Support
  • Identify three architectural and compiler enhancements
    • Significant impact on the performance
      • Roughly by a factor of 2

Parallel Architecture and Compilation Techniques, 2003

outline
Outline
  • Motivation
  • Baseline Compiler
  • Three Enhancements
  • Performance Evaluation
  • Conclusions

Parallel Architecture and Compilation Techniques, 2003

vertex shader programs
Small Programs (at most 256 instructions)

SIMD instructions with xyzw components

Mask and Swizzle on each instruction

No state saved between vertices

Read-only memory & Temporary Registers

Program cannot change control flow

Vertex Shader Programs

Temporary

Registers

12 x 4

Constant

Memory

256 x 4

Integer

Registers

84 x 1

Vertex Input

16 x 4 Registers

Virtual

Machine

SIMD ALU

Vertex Output

15 x 4 Registers

dp4 oPos.x, v0, c[0]

dp4 oPos.y, v0, c[1]

dp4 oPos.z, v0, c[2]

dp4 oPos.w, v0, c[3]

mov oD0, c[4].wzyx

Parallel Architecture and Compilation Techniques, 2003

baseline optimizing compiler
Baseline Optimizing Compiler
  • Implemented a Compiler for Vertex Shaders

Input: Vertex Shader Assembly

Output: Optimized x86 (with SSE2)

    • Started with DirectX reference rasterizer: Interpreter
      • Used it as the front end
    • Use Olive pattern-matching code-generator generator
    • Graph-coloring based register allocator
    • Loop unrolling
    • List-scheduler
  • About 70% faster than a naïve translator
    • Translate into C and feed it to a C compiler

Parallel Architecture and Compilation Techniques, 2003

characteristics of generated code
Characteristics of Generated Code
  • Mostly SIMD instructions (x86 with SSE2)
    • 83-99 % instructions
  • Large basic blocks
    • Use of control-flow is limited
    • Makes it easier to compile efficiently
  • Vertex Shared Assembly to x86 Assembly
    • 10-20 times increase in number of instructions

mul r0.x_z_, v0.xyzz, v1.wwww

Parallel Architecture and Compilation Techniques, 2003

outline1
Outline
  • Motivation
  • Baseline Compiler
  • Three Enhancements
  • Performance Evaluation
  • Conclusions

Parallel Architecture and Compilation Techniques, 2003

1 new instructions
1. New Instructions
  • Dot products are very common in Shaders
  • A dot product translates is expensive on x86
    • A sequence of 7 instructions
    • 1 multiply, 2 add, 4 shuffle instructions
      • In the simple case
  • New dot product instructions
    • Compute dot product of two source operands and store it in each of the word of the destination operand

Parallel Architecture and Compilation Techniques, 2003

2 mask analysis optimization
2. Mask Analysis Optimization
  • Traditional optimizers keep track of the liveness information on a per-register basis
    • Shaders: often only part of the SIMD register is live
    • Modify to do this for each word of the SIMD register
  • Analysis Phase
    • Annotate the IR with additional information
    • During live variable analysis, propagate the liveness mask depending on the instructions
  • Optimization Phase
    • Identify dead code
    • Replace some shuffle/mask instructions with move
      • Might get eliminated entirely during register allocation

Parallel Architecture and Compilation Techniques, 2003

3 number of registers
3. Number of Registers
  • Spilling registers to memory can degrade performance
  • Investigate the impact of increasing the number of registers from 8 to 16
  • Why not more?
    • Trickier to encode it in the ISA

Parallel Architecture and Compilation Techniques, 2003

outline2
Outline
  • Motivation
  • Baseline Compiler
  • Three Enhancements
  • Performance Evaluation
  • Conclusions

Parallel Architecture and Compilation Techniques, 2003

experimental setup
Experimental Setup
  • 10 Vertex Shaders
    • 8-84 instructions
    • Only 3 of them have loops (Control)
  • 2.2 GHz Pentium IV processor
    • Instruction counts otherwise
    • Breakdown the instructions into categories
  • Measure performance by using the generated code to process an array of vertices
    • Compute average

Parallel Architecture and Compilation Techniques, 2003

evaluation
New dot-product Instructions:27.4% Average (Estimate)

Reduces the number of instructions by 24 %

Mask optimization: 19.5% on Average

Both: 42% on Average

Evaluation

Normalized

Execution Time

Vertex Shaders

Parallel Architecture and Compilation Techniques, 2003

evaluation cont d
Reduce the number of instructions by 8 % on average

35-100% of the spill instructions

This understates the potential benefit

More registers allow more aggressive optimizations like instruction scheduling

Evaluation Cont’d

Normalized

Instruction Count

Vertex Shaders

Parallel Architecture and Compilation Techniques, 2003

outline3
Outline
  • Motivation
  • Baseline Compiler
  • Three Enhancement
  • Performance Evaluation
  • Conclusions

Parallel Architecture and Compilation Techniques, 2003

conclusions future work
Conclusions & Future Work
  • Implemented an Optimizing Compiler for Vertex Shaders
  • Propose and Evaluate Three Enhancements
    • Compiler: Mask Optimization
    • Architectural: New Instructions & More registers

Improve the performance by a factor of 2 (Roughly)

  • Shaders are evolving rapidly
    • More like general purpose processors
    • More complex model

Parallel Architecture and Compilation Techniques, 2003