enhancing gpu for scientific computing n.
Download
Skip this Video
Download Presentation
Enhancing GPU for Scientific Computing

Loading in 2 Seconds...

play fullscreen
1 / 26

Enhancing GPU for Scientific Computing - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Enhancing GPU for Scientific Computing. Some thoughts. Outline. Motivation Related work BLAS Library Execution Model Benchmarks Recommendations. Motivation. GPU Computing Vector and Fragment Processor streaming (super)-computers enormous performance!

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Enhancing GPU for Scientific Computing' - zanthe


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Motivation
  • Related work
  • BLAS Library
  • Execution Model
  • Benchmarks
  • Recommendations
motivation
Motivation
  • GPU Computing
    • Vector and Fragment Processor streaming (super)-computers
    • enormous performance!
    • ATI 9700, NV30
  • They have become programmable
  • Emerging application areas
    • Numerical Sim.[Schroder’03], Sorting, Genomics, etc.
    • Goal: Scientific Computing
motivation1
Motivation
  • Most software built from small-efficient parts
  • Scientific apps built on top of s/w library routines
  • Harnessing GPU resources
    • Arithmetic Intensive
    • Data parallel
  • BLAS Library
related work
Related work
  • Using non-programmable GPUs
  • [Erik’01] prog. vertex engine for lighting/morphing
  • [Oskin’02] vector processing using VP
  • [Ian’03] stream processing using FP
  • Problems :
    • Monolithic Big Programs
    • One of VP or FP
    • CPU – Passive Mode
    • No Cascading Loop-backs (Parallelism, Setup Times)
blas library
BLAS Library
  • BLAS (Basic Linear Algebra Subprograms)
    • Building blocks for vector and matrix operations
    • development of highly efficient linear algebra software
      • LINPACK and LAPACK
  • Operations
    • Scalar – Vector
    • Vector – Vector
    • Vector – Matrix
    • Matrix – Matrix
mapping

CPU

All operations

CPU

VP

FP

CPU

VP

Non-matrix ops

FP

All operations

Mapping
  • Operation processor
  • CPU/FP - All ops
  • VP - no memory access
  • Restricted data-flows
    • CPU FP
    • VP CPU
slide8

(Vectors, Vectors)

Execution graph

Vector Scalar Add Operation

vAdd

CPU

  • In this example, a Vector of length n is segmented into m other vectors of length 4 in the CPU function vsAdd.
  • The vertex program vsAdd.cg is loaded onto the vertex processor and the scalar value is passed as a parameter.
  • Subsequently, CPU function vsAdd will stream the set of m vectors onto the CPU as openGL primitive points. Our vertex program, vsAdd.cg will add the scalar value to all fields in the m vertices.
  • Consequently, these vertices will proceed to the fragment processor and written onto the framebuffer memory.
  • The CPU function vsADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Scalar add.
  • Lastly the CPU function concatenates the sequence of color values into a vector of length n as result.

vAdd.cg

[Vertex]m (GL_POINTS)

[vAdd.cg]

Vertex Processor

[Vertex]m

G

P

U

[None]

Fragment Processor

Texture Mem

PBuffer

TextureDatam

[Texture Color values]m

vAdd

CPU

(Vectors)

slide9

Execution graph

Vector Vector Add Operation

GL_QUAD [Vector4]m

vAdd

CPU

  • In this example, 2 vectors of length s are transformed into texture data in the CPU function vAdd.
  • The vertex program vAdd.cg, and texture data are loaded onto the fragment processor GPU memory respectively.
  • Subsequently, CPU function vAdd will draw a quadrilateral primitive having s pixels.
  • The vertex processor does nothing and passes on the vertices to the rasterizer to process into pixel representation.
  • The rasterizer creates the s pixels for fragment processing.
  • For each pixel, our fragment processor will lookup the values from both textures and determine the color value of each pixel. These pixels are written onto the Pbuffer memory.
  • The CPU function vADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Vector add.
  • The output in Pbuffer is then converted into a texture entry.
  • Lastly the CPU function reads the texture entry and concatenates the sequence of color values into a vector of length s as result.

[Vertex4]m GL_QUAD

vAdd.cg

[None]

Vertex Processor

[Vertex4]m

G

P

U

[vAdd.cg]

Fragment Processor

TextureData1m

TextureData2m

Texture Mem

PBuffer

TextureData3m

[Texture Color values]m

vAdd

CPU

(Vectors)

slide10

Execution graph

2 Vector Vector Add Operations

GL_QUAD [Vectex4]m

vAdd

CPU

  • In this example, we perform 2 separate vector vector add operations.
  • The 1st operation proceeds as described earlier in our vector vector add operation.
  • The output of the 1st operation is used as input for the 2nd operation.
  • Since it’s the same operation, we do not load a new Vertex or Fragment program. However we proceed to load a new texture data.
  • The 2nd operation proceeds as normal.
  • Lastly the CPU function concatenates the sequence of color values into a vector of length s as result.

[Vertex4]m

[None]

Vertex Processor

TextureData4m

[Vertex4]m

G

P

U

[vAdd.cg]

Fragment Processor

Texture Mem

PBuffer

TextureData3m

[Texture Color values]m

vAdd

CPU

(Vectors)

performance issues
Performance Issues
  • Representation inefficiency
    • Memory
      • Data stored both in CPU and GPU
    • Communication costs
      • Loading data onto GPU
      • Reading data from GPU
  • Execution inefficiencies
    • Computation setup overhead
      • Remodeling CPU data for GPU
    • Problem execution time
      • Rendering
      • Texture lookups
observations
Observations
  • Fixed-point operations are much faster than FP16/FP32 operations
  • FP16/FP32 operations have similar performance
  • VP is slower than FP
    • Operation mappings involving both VP and FP result in inefficient pipeline
observations1
Observations
  • Simple operations perform better on CPU
  • Best to design whole algorithm as single VP/FP program
    • Memory cost for storing intermediate results
    • Execution cost ?
    • More textures result in decreased performance
bug reports filed
Bug Reports Filed!
  • Incorrect dump of floating point values after render to texture [NVIDIA confirmed]
  • cgSetcolor parameter does not update alpha values [Awaiting reply]
recommendations 3d graphics hackers
Recommendations (3D Graphics Hackers)
  • Load important data into Video memory
  • Maximum use of Fixed-point Pipeline
  • Code optimization important (Instr., Memory)
  • Upgrade your video card drivers (must!)
    • Hacking graphics hardware is a *real* pain!
recommendations cg
Recommendations (Cg)
  • Pointer meaningful for numerical computing
  • Texture fetch instructions (add. Offsets)
  • Accumulation registers (sum)
  • Preserving State across multiple calls
  • Introduce stack mechanisms
  • Introduce bit wise operators
recommendations hardware
Recommendations (Hardware)
  • Allow GPU to read/write from CPU memory
  • VP and FP as 1st class processors on GPU
    • Similar cores and instruction sets
    • Allow full parallelism
  • Allow CPU to read/write all registers in GPU processors
  • Introduce a stack
  • Introduce bit wise operators
deliverables
Deliverables!
  • A draft subset of the BLAS library
  • Architecture Insights (issues/constraints)
  • NV30 Improvements (Bug reports)
  • Technical Write-up