1 / 26

Enhancing GPU for Scientific Computing

Enhancing GPU for Scientific Computing. Some thoughts. Outline. Motivation Related work BLAS Library Execution Model Benchmarks Recommendations. Motivation. GPU Computing Vector and Fragment Processor streaming (super)-computers enormous performance!

zanthe
Download Presentation

Enhancing GPU for Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing GPU for Scientific Computing Some thoughts

  2. Outline • Motivation • Related work • BLAS Library • Execution Model • Benchmarks • Recommendations

  3. Motivation • GPU Computing • Vector and Fragment Processor streaming (super)-computers • enormous performance! • ATI 9700, NV30 • They have become programmable • Emerging application areas • Numerical Sim.[Schroder’03], Sorting, Genomics, etc. • Goal: Scientific Computing

  4. Motivation • Most software built from small-efficient parts • Scientific apps built on top of s/w library routines • Harnessing GPU resources • Arithmetic Intensive • Data parallel • BLAS Library

  5. Related work • Using non-programmable GPUs • [Erik’01] prog. vertex engine for lighting/morphing • [Oskin’02] vector processing using VP • [Ian’03] stream processing using FP • Problems : • Monolithic Big Programs • One of VP or FP • CPU – Passive Mode • No Cascading Loop-backs (Parallelism, Setup Times)

  6. BLAS Library • BLAS (Basic Linear Algebra Subprograms) • Building blocks for vector and matrix operations • development of highly efficient linear algebra software • LINPACK and LAPACK • Operations • Scalar – Vector • Vector – Vector • Vector – Matrix • Matrix – Matrix

  7. CPU All operations CPU VP FP CPU VP Non-matrix ops FP All operations Mapping • Operation processor • CPU/FP - All ops • VP - no memory access • Restricted data-flows • CPU FP • VP CPU

  8. (Vectors, Vectors) Execution graph Vector Scalar Add Operation vAdd CPU • In this example, a Vector of length n is segmented into m other vectors of length 4 in the CPU function vsAdd. • The vertex program vsAdd.cg is loaded onto the vertex processor and the scalar value is passed as a parameter. • Subsequently, CPU function vsAdd will stream the set of m vectors onto the CPU as openGL primitive points. Our vertex program, vsAdd.cg will add the scalar value to all fields in the m vertices. • Consequently, these vertices will proceed to the fragment processor and written onto the framebuffer memory. • The CPU function vsADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Scalar add. • Lastly the CPU function concatenates the sequence of color values into a vector of length n as result. vAdd.cg [Vertex]m (GL_POINTS) [vAdd.cg] Vertex Processor [Vertex]m G P U [None] Fragment Processor Texture Mem PBuffer TextureDatam [Texture Color values]m vAdd CPU (Vectors)

  9. Execution graph Vector Vector Add Operation GL_QUAD [Vector4]m vAdd CPU • In this example, 2 vectors of length s are transformed into texture data in the CPU function vAdd. • The vertex program vAdd.cg, and texture data are loaded onto the fragment processor GPU memory respectively. • Subsequently, CPU function vAdd will draw a quadrilateral primitive having s pixels. • The vertex processor does nothing and passes on the vertices to the rasterizer to process into pixel representation. • The rasterizer creates the s pixels for fragment processing. • For each pixel, our fragment processor will lookup the values from both textures and determine the color value of each pixel. These pixels are written onto the Pbuffer memory. • The CPU function vADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Vector add. • The output in Pbuffer is then converted into a texture entry. • Lastly the CPU function reads the texture entry and concatenates the sequence of color values into a vector of length s as result. [Vertex4]m GL_QUAD vAdd.cg [None] Vertex Processor [Vertex4]m G P U [vAdd.cg] Fragment Processor TextureData1m TextureData2m Texture Mem PBuffer TextureData3m [Texture Color values]m vAdd CPU (Vectors)

  10. Execution graph 2 Vector Vector Add Operations GL_QUAD [Vectex4]m vAdd CPU • In this example, we perform 2 separate vector vector add operations. • The 1st operation proceeds as described earlier in our vector vector add operation. • The output of the 1st operation is used as input for the 2nd operation. • Since it’s the same operation, we do not load a new Vertex or Fragment program. However we proceed to load a new texture data. • The 2nd operation proceeds as normal. • Lastly the CPU function concatenates the sequence of color values into a vector of length s as result. [Vertex4]m [None] Vertex Processor TextureData4m [Vertex4]m G P U [vAdd.cg] Fragment Processor Texture Mem PBuffer TextureData3m [Texture Color values]m vAdd CPU (Vectors)

  11. Performance Issues • Representation inefficiency • Memory • Data stored both in CPU and GPU • Communication costs • Loading data onto GPU • Reading data from GPU • Execution inefficiencies • Computation setup overhead • Remodeling CPU data for GPU • Problem execution time • Rendering • Texture lookups

  12. Observations • Fixed-point operations are much faster than FP16/FP32 operations • FP16/FP32 operations have similar performance • VP is slower than FP • Operation mappings involving both VP and FP result in inefficient pipeline

  13. Observations • Simple operations perform better on CPU • Best to design whole algorithm as single VP/FP program • Memory cost for storing intermediate results • Execution cost ? • More textures result in decreased performance

  14. Bug Reports Filed! • Incorrect dump of floating point values after render to texture [NVIDIA confirmed] • cgSetcolor parameter does not update alpha values [Awaiting reply]

  15. Recommendations (3D Graphics Hackers) • Load important data into Video memory • Maximum use of Fixed-point Pipeline • Code optimization important (Instr., Memory) • Upgrade your video card drivers (must!) • Hacking graphics hardware is a *real* pain!

  16. Recommendations (Cg) • Pointer meaningful for numerical computing • Texture fetch instructions (add. Offsets) • Accumulation registers (sum) • Preserving State across multiple calls • Introduce stack mechanisms • Introduce bit wise operators

  17. Recommendations (Hardware) • Allow GPU to read/write from CPU memory • VP and FP as 1st class processors on GPU • Similar cores and instruction sets • Allow full parallelism • Allow CPU to read/write all registers in GPU processors • Introduce a stack • Introduce bit wise operators

  18. Deliverables! • A draft subset of the BLAS library • Architecture Insights (issues/constraints) • NV30 Improvements (Bug reports) • Technical Write-up

  19. The End

More Related