1 / 16

On Using Graphics Hardware for Scientific Computing

On Using Graphics Hardware for Scientific Computing ________________________________________________. Stan Tomov June 23, 2006. Outline. Motivation Literature review The graphics pipeline Programmable GPUs Some application examples Performance results Conclusion. Motivation.

wilona
Download Presentation

On Using Graphics Hardware for Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006

  2. Outline • Motivation • Literature review • The graphics pipeline • Programmable GPUs • Some application examples • Performance results • Conclusion

  3. Motivation Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes.

  4. Motivation • High flops count (currently 200GFlops, single precision) • Compatible price performance (less then 1 cent per MFlop) • Performance doubling every 6 months • Continuously increasing functionality and programmability • Realistic games require more complicated physics (picture: from the GPU Gems 2 book)

  5. Literature review Using graphics hardware for non-graphics applications (just a few examples): • Cellular automata • Reaction-diffusion simulation (Mark Harris, University of North Carolina) • Matrix multiply (E. Larsen and D. McAllister, University of North Carolina) • Lattice Boltzmann (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) • CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) • BLAS 1,2; fft; certain eigensolvers; etc. • See also GPGPU’s homepage : http://www.gpgpu.org/

  6. Literature review Typical performance results reported (by the middle of 2003): • Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point • Vendor advertisements about very high performance assume low precision arithmetic • NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second • GPU’s 32-bit flops performance is comparable to the CPU’s(may be 2-4 times faster depending on application and configuration)

  7. The graphics pipeline • GeForce 256 (August, 1999) - allowed certain degree of programmability - before: fixed function pipeline • GeForce 3 (February, 2001) - considered first fully programmable GPU • GeForce 4 - partial 16-bit floating point arithmetic • NV30 - 32-bit floating point • Cg - high-level programming language

  8. The graphics pipeline • GPUs: on their way into turning into programmable stream processors • Stream formulation of the graphics pipeline:all data viewed as streams and computation as kernels • Streaming • Efficient computation (enable efficient parallelism; deep pipeline) • Efficient communication (efficient off-chip communication; intermediate results kept on chip; deep pipelining allows high degree of latency tolerance (picture: from the GPU Gems 2 book)

  9. Programmable GPUs (in particular NV30) • GPU programming model: streaming • Naturally addresses parallelism and communication • Easy when problems maps well • Support floating point operations • Vertex program • Replaces fixed-function pipeline for vertices • Manipulates single vertex data • Executes for every vertex • Fragment program • Similar to vertex program but for pixels • Programming in Cg: • High level language; looks like C; portable; compiles Cg programs to assembly code

  10. Block Diagram of GeForce FX • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

  11. Block Diagram of GeForce FX 3 vertex and 8 pixel processors Last nVidia card: dual-GPU GeForce 7950 GX2 with 32 vertex and 96 pixel processors • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transfor- mations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

  12. Summary of CPU vs GPU • General vs specialized hardware • CPUs have more complex control hardware • GPU can have hardware acceleration for specific tasks • Sequential vs parallel programming models • In general CPUs don’t have the GPU’s level of data parallelism (though some may be available: Intel’s SSE and PowerPC’s AltiVec instructions sets) • Memory latency vs bandwidth optimization

  13. Some application examples • Monte Carlo simulations • Used in variety of simulations in physics, finance, chemistry, etc. • Based on probability statistics and use random numbers • A classical example: compute area of a circle • Computation of expected values: • N can be very large : on a 1024 x 1024 lattice of particles, every particle modeled to have k states, N = • Random number generation. We used linear congruential type generator:

  14. Some application examples • Monte Carlo simulations • Ising model • Simplified model for magnets • Evolve the system into “higher probability” states and compute expected values as average over only those states • Percolation • In studies of disease spreading, flow in porous media, forest fire propagation, clustering, etc. • Lattice Boltzmann method • Simulate fluid flow; particles are allowed to move and collide on a lattice

  15. Some performance results • saxpy on 512 x 512 (x 4) vectors  1GFlop • speed limited by GPU memory bandwidth (16 GB/s) • sin, cos, exp, log 20 times faster than on Pentium 4, 2.8GHz • hardware accelerated of low accuracy • Ising model  7GFlops • 44% of theoretical maximum • On fragment program compiled to 109 assembly instructions

  16. Conclusions • What to expect for future GPGPUs?Can GPGPUs influence future computer systems ? ( HPC and consequently our models of software development: is the IBM’s Cell processor already an example? ) Current trends: CPU  multi-core GPU  more powerful streaming model (Gather, scatter, conditional streams, reduction, etc.) more CPU functionality

More Related