On Using Graphics Hardware for Scientific Computing

On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006

Outline • Motivation • Literature review • The graphics pipeline • Programmable GPUs • Some application examples • Performance results • Conclusion

Motivation Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes.

Motivation • High flops count (currently 200GFlops, single precision) • Compatible price performance (less then 1 cent per MFlop) • Performance doubling every 6 months • Continuously increasing functionality and programmability • Realistic games require more complicated physics (picture: from the GPU Gems 2 book)

Literature review Using graphics hardware for non-graphics applications (just a few examples): • Cellular automata • Reaction-diffusion simulation (Mark Harris, University of North Carolina) • Matrix multiply (E. Larsen and D. McAllister, University of North Carolina) • Lattice Boltzmann (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) • CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) • BLAS 1,2; fft; certain eigensolvers; etc. • See also GPGPU’s homepage : http://www.gpgpu.org/

Literature review Typical performance results reported (by the middle of 2003): • Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point • Vendor advertisements about very high performance assume low precision arithmetic • NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second • GPU’s 32-bit flops performance is comparable to the CPU’s(may be 2-4 times faster depending on application and configuration)

The graphics pipeline • GeForce 256 (August, 1999) - allowed certain degree of programmability - before: fixed function pipeline • GeForce 3 (February, 2001) - considered first fully programmable GPU • GeForce 4 - partial 16-bit floating point arithmetic • NV30 - 32-bit floating point • Cg - high-level programming language

The graphics pipeline • GPUs: on their way into turning into programmable stream processors • Stream formulation of the graphics pipeline:all data viewed as streams and computation as kernels • Streaming • Efficient computation (enable efficient parallelism; deep pipeline) • Efficient communication (efficient off-chip communication; intermediate results kept on chip; deep pipelining allows high degree of latency tolerance (picture: from the GPU Gems 2 book)

Programmable GPUs (in particular NV30) • GPU programming model: streaming • Naturally addresses parallelism and communication • Easy when problems maps well • Support floating point operations • Vertex program • Replaces fixed-function pipeline for vertices • Manipulates single vertex data • Executes for every vertex • Fragment program • Similar to vertex program but for pixels • Programming in Cg: • High level language; looks like C; portable; compiles Cg programs to assembly code

Block Diagram of GeForce FX • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

Block Diagram of GeForce FX 3 vertex and 8 pixel processors Last nVidia card: dual-GPU GeForce 7950 GX2 with 32 vertex and 96 pixel processors • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

Summary of CPU vs GPU • General vs specialized hardware • CPUs have more complex control hardware • GPU can have hardware acceleration for specific tasks • Sequential vs parallel programming models • In general CPUs don’t have the GPU’s level of data parallelism (though some may be available: Intel’s SSE and PowerPC’s AltiVec instructions sets) • Memory latency vs bandwidth optimization

Some application examples • Monte Carlo simulations • Used in variety of simulations in physics, finance, chemistry, etc. • Based on probability statistics and use random numbers • A classical example: compute area of a circle • Computation of expected values: • N can be very large : on a 1024 x 1024 lattice of particles, every particle modeled to have k states, N = • Random number generation. We used linear congruential type generator:

Some application examples • Monte Carlo simulations • Ising model • Simplified model for magnets • Evolve the system into “higher probability” states and compute expected values as average over only those states • Percolation • In studies of disease spreading, flow in porous media, forest fire propagation, clustering, etc. • Lattice Boltzmann method • Simulate fluid flow; particles are allowed to move and collide on a lattice

Some performance results • saxpy on 512 x 512 (x 4) vectors  1GFlop • speed limited by GPU memory bandwidth (16 GB/s) • sin, cos, exp, log 20 times faster than on Pentium 4, 2.8GHz • hardware accelerated of low accuracy • Ising model  7GFlops • 44% of theoretical maximum • On fragment program compiled to 109 assembly instructions

Conclusions • What to expect for future GPGPUs?Can GPGPUs influence future computer systems ? ( HPC and consequently our models of software development: is the IBM’s Cell processor already an example? ) Current trends: CPU  multi-core GPU  more powerful streaming model (Gather, scatter, conditional streams, reduction, etc.) more CPU functionality

On Using Graphics Hardware for Scientific Computing

On Using Graphics Hardware for Scientific Computing

Presentation Transcript

Graphics Hardware

Volume Rendering using Graphics Hardware

Scientific Computing on

Hardware for Ubiquitous Computing

GRAPHICS HARDWARE

Graphics Hardware

3D Skeletons Using Graphics Hardware

Agent Simulations on Graphics Hardware

Computer Graphics and Scientific Computing

Graphics Hardware

Using CUDA for High Performance Scientific Computing

Graphics Hardware

Data Parallel Computing on Graphics Hardware

Scientific Computing on Graphics Hardware

Ray Tracing using Programmable Graphics Hardware

Visibility Queries Using Graphics Hardware

Graphics Hardware and Software using Linux

Using Graphics Hardware for Multiple Datasets Visualization

Scientific Computing using Python

Summed Area Tables using Graphics Hardware

Graphics Hardware