streaming architectures and gpus n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Streaming Architectures and GPUs PowerPoint Presentation
Download Presentation
Streaming Architectures and GPUs

Loading in 2 Seconds...

play fullscreen
1 / 25

Streaming Architectures and GPUs - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Streaming Architectures and GPUs. Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004. To Exploit VLSI Technology We Need:. Parallelism To keep 100s of ALUs per chip (thousands/board millions/system) busy Latency tolerance To cover 500 cycle remote memory access time

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Streaming Architectures and GPUs' - dorie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
streaming architectures and gpus

Streaming Architectures and GPUs

Ian Buck

Bill Dally & Pat Hanrahan

Stanford University

February 11, 2004

to exploit vlsi technology we need
To Exploit VLSI Technology We Need:
  • Parallelism
    • To keep 100s of ALUs per chip (thousands/board millions/system) busy
  • Latency tolerance
    • To cover 500 cycle remote memory access time
  • Locality
    • To match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth
  • Moore’s Law
    • Growth of transistors, not performance

Courtesy of Bill Dally

Arithmetic is cheap, global bandwidth is expensive

Local << global on-chip << off-chip << global system

arithmetic intensity
Arithmetic Intensity

Lots of ops per word transferred

  • Compute-to-Bandwidth ratio
  • High Arithmetic Intensity desirable
    • App limited by ALU performance, not off-chip bandwidth
    • More chip real estate for ALUs, not caches

Courtesy of Pat Hanrahan

brook stream programming model
Brook: Stream programming Model
  • Enforce Data Parallel computing
  • Encourage Arithmetic Intensity
  • Provide fundamental ops for stream computing
streams kernels
Streams & Kernels
  • Streams
    • Collection of records requiring similar computation
      • Vertex positions, voxels, FEM cell, …
    • Provide data parallelism
  • Kernels
    • Functions applied to each element in stream
      • transforms, PDE, …
      • No dependencies between stream elements
    • Encourage high Arithmetic Intensity
vectors vs streams
Vectors:

v: array of floats

Instruction sequence

LD v0

LD v1

ADD v0, v1, v2

ST v2

 Large set of temps

Streams:

s: stream of records

Instruction sequence

LD s0

LD s1

CALLS f, s0, s1, s2

ST s2

 Small set of temps

Vectors vs. Streams

Higher arithmetic intensity: |f|/|s|  |+|/|v|

imagine

SDRAM

ALU Cluster

ALU Cluster

SDRAM

Stream

Register File

SDRAM

SDRAM

ALU Cluster

544GB/s

2GB/s

32GB/s

Imagine
  • Imagine
    • Stream processor for image and signal processing
    • 16mm die in 0.18um TI process
    • 21M transistors
merrimac processor
Merrimac Processor
  • 90nm tech (1 V)
  • ASIC technology
  • 1 GHz (37 FO4)
  • 128 GOPs
  • Inter-cluster switch between clusters
  • 127.5 mm2 (small ~12x10)
    • Stanford Imagine is 16mm x 16mm
    • MIT Raw is 18mm x 18mm
  • 25 Watts (P4 = 75 W)
    • ~41W with memories

r

r

r

r

e

e

e

e

t

t

t

t

s

s

s

s

u

u

u

u

Mips64

Mips64

l

l

l

l

C

C

C

C

20kc

20kc

r

r

r

r

e

e

e

e

t

t

t

t

s

s

s

s

$ bank

u

u

u

u

r

l

l

l

l

n

n

e

C

C

C

C

f

e

e

s

f

u

G

G

$ bank

e

m

B

c

s

s

r

m

s

s

a

e

e

e

C

f

d

$ bank

Microcontroller

r

r

r

2

r

d

d

C

o

.

e

d

d

h

e

t

0

E

A

A

c

n

R

$ bank

t

1

i

I

d

r

r

r

r

w

r

s

e

e

e

e

M

t

t

t

t

a

m

$ bank

s

s

s

s

A

w

r

e

u

u

u

u

n

n

e

R

r

f

l

l

l

l

M

e

e

f

o

D

C

C

C

C

u

G

G

$ bank

F

B

R

s

s

r

s

s

e

e

e

6

$ bank

d

r

r

1

r

d

d

o

d

d

r

r

r

r

e

e

e

e

e

A

A

R

$ bank

t

t

t

t

s

s

s

s

u

u

u

u

l

l

l

l

C

C

C

C

Network

12.5 mm

streaming applications
Streaming Applications
  • Finite volume – StreamFLO (from TFLO)
  • Finite element - StreamFEM
  • Molecular dynamics code (ODEs) - StreamMD
  • Model (elliptic, hyperbolic and parabolic) PDEs
  • PCA Applications: FFT, Matrix Mul, SVD, Sort
streamflo
StreamFLO
  • StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil.
  • The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations.
  • The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.
streamfem
StreamFEM
  • A Brook implementation of the Discontinuous Galerkin (DG) Finite Element
  • Method (FEM) in 2D triangulated domains.
streammd motivation
StreamMD: motivation
  • Application: study the folding of human proteins.
  • Molecular Dynamics: computer simulation of the dynamics of macro molecules.
  • Why this application?
    • Expect high arithmetic intensity.
    • Requires variable length neighborlists.
    • Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.
  • Test case chosen for initial evaluation: box of water molecules.

DNA molecule

Human immunodeficiency virus (HIV)

summary of application results
Summary of Application Results

1. Simulated on a machine with 64GFLOPS peak performance

2. The low numbers are a result of many divide and square-root operations

streaming on graphics hardware
Streaming on graphics hardware?

Pentium 4 SSE theoretical*

3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:

MULR R0, R0, R0: 20 GFLOPS

equivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

GeForce FX

NV35

NV30

Pentium 4

*from Intel P4 Optimization Manual

gpu program architecture
GPU Program Architecture

Input

Registers

Texture

Program

Constants

Registers

Output

Registers

example program
Example Program

Simple Specular and Diffuse Lighting

!!VP1.0

#

# c[0-3] = modelview projection (composite) matrix

# c[4-7] = modelview inverse transpose

# c[32] = eye-space light direction

# c[33] = constant eye-space half-angle vector (infinite viewer)

# c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat.

# c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat.

# c[36] = specular color

# c[38].x = specular power

# outputs homogenous position and color

#

DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position.

DP4 o[HPOS].y, c[1], v[OPOS];

DP4 o[HPOS].z, c[2], v[OPOS];

DP4 o[HPOS].w, c[3], v[OPOS];

DP3 R0.x, c[4], v[NRML]; # Compute normal.

DP3 R0.y, c[5], v[NRML];

DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normal

DP3 R1.x, c[32], R0; # R1.x = Ldir DOT N'

DP3 R1.y, c[33], R0; # R1.y = H DOT N'

MOV R1.w, c[38].x; # R1.w = specular power

LIT R2, R1; # Compute lighting values

MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient

MAD o[COL0].xyz, c[36], R2.z, R3; # + specular

END

cg hlsl high level language for gpus
Cg/HLSL: High level language for GPUs

Specular Lighting

// Lookup the normal map

float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5);

// Multiply 3 X 2 matrix generated using lightDir and halfAngle with

// scaled normal followed by lookup in intensity map with the result.

float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz),

dot(I.halfAngle.xyz, normal.xyz));

float4 intensity = tex2D(intensityMap, intensCoord);

// Lookup color

float4 color = tex2D(colorMap, I.texCoord3.xy);

// Blend/Modulate intensity with color

return color * intensity;

gpu data parallel
GPU: Data Parallel
  • Each fragment shaded independently
    • No dependencies between fragments
      • Temporary registers are zeroed
      • No static variables
      • No Read-Modify-Write textures
    • Multiple “pixel pipes”
  • Data Parallelism
    • Support ALU heavy architectures
    • Hide Memory Latency

[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

gpu arithmetic intensity
GPU: Arithmetic Intensity

Lots of ops per word transferred

Graphics pipeline

  • Vertex
    • BW: 1 triangle = 32 bytes;
    • OP: 100-500 f32-ops / triangle
  • Rasterization
    • Create 16-32 fragments per triangle
  • Fragment
    • BW: 1 fragment = 10 bytes
    • OP: 300-1000 i8-ops/fragment

Shader Programs

Courtesy of Pat Hanrahan

streaming architectures
Streaming Architectures

SDRAM

ALU Cluster

ALU Cluster

SDRAM

Stream

Register File

SDRAM

SDRAM

ALU Cluster

streaming architectures1
Streaming Architectures

Kernel Execution Unit

MAD R3, R1, R2;

MAD R5, R2, R3;

SDRAM

ALU Cluster

ALU Cluster

SDRAM

Stream

Register File

SDRAM

SDRAM

ALU Cluster

streaming architectures2
Streaming Architectures

Kernel Execution Unit

MAD R3, R1, R2;

MAD R5, R2, R3;

SDRAM

ALU Cluster

ALU Cluster

SDRAM

Stream

Register File

SDRAM

SDRAM

ALU Cluster

Parallel Fragment Pipelines

streaming architectures3

SDRAM

ALU Cluster

ALU Cluster

SDRAM

Stream

Register File

SDRAM

SDRAM

ALU Cluster

Streaming Architectures

Kernel Execution Unit

MAD R3, R1, R2;

MAD R5, R2, R3;

  • Stream Register File:
  • Texture Cache?
  • F-Buffer [Mark et al.]

Parallel Fragment Pipelines

conclusions
Conclusions
  • The problem is bandwidth – arithmetic is cheap
  • Stream processing & architectures can provide VLSI-efficient scientific computing
    • Imagine
    • Merrimac
  • GPUs are first generation streaming architectures
    • Apply same stream programming model for general purpose computing on GPUs

GeForce FX