1 / 18

Data Analysis and Visualization

Data Analysis and Visualization. Monte Carlo Simulations Using Programmable Graphics Cards. Stan Tomov. ( October 31, 2003 ). Presentation : http://www.ccd.bnl.gov/~tomov/GPU_SUNY_SB.ppt Article : http://www.ccd.bnl.gov/~tomov/IsingArticle.pdf. Outline. Motivation

eeloise
Download Presentation

Data Analysis and Visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis and Visualization Monte Carlo Simulations Using Programmable Graphics Cards Stan Tomov ( October 31, 2003 ) Presentation : http://www.ccd.bnl.gov/~tomov/GPU_SUNY_SB.ppt Article : http://www.ccd.bnl.gov/~tomov/IsingArticle.pdf

  2. Outline • Motivation • Literature review • The graphics pipeline • Programmable GPUs • Block diagram of nVidia's GeForce FX • Some probability based simulations - Monte Carlo simulations - Ising model - Percolation model • Implementation • Performance results and analysis • Extensions and future work • Conclusions

  3. Motivation The GPUs have: • High flops count (nVidia has listed 200Gflops theoretical speed for NV30) • Compatible price performance (0.1 cents per M flop) • High rate of performance increase over time (doubling every 6 months) Table 1.GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes. Explore the possibility of extending GPUs' use to non-graphics applications

  4. Literature review Using graphics hardware for non-graphics applications: • Cellular automata • Reaction-diffusion simulation(Mark Harris, University of North Carolina) • Matrix multiply(E. Larsen and D. McAllister, University of North Carolina) • Lattice Boltzmann computation(Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) • CG and multigrid(J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) • See also GPGPU’s homepage : http://www.gpgpu.org/ Performance results: • Significant speedup of GPU vs CPU are reported if the GPU performs • low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point • Advertisements about very high performance assume low precision arithmetic: • - NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 • consoles, which could theoretically deliver 0.5 trillion operations/second • - currently $200 GPUs are capable of 1.2 trillion op/s • GPU’s 32-bit flops performance is comparable to the CPU’s

  5. The graphics pipeline GeForce 256 (August, 1999) - allowed certain degree of programmability - before: fixed function pipeline GeForce 3 (February, 2001) - considered first fully programmable GPU GeForce 4 - partial 16-bit floating point arithmetic NV30 - 32-bit floating point Cg - high-level programming language

  6. Programmable GPUs (in particular NV30) • Support floating point operations • Vertex program • - Replaces fixed-function pipeline for vertices • - Manipulates single vertex data • - Executes for every vertex • Fragment program • - Similar to vertex program but for pixels • Programming in Cg: • - High level language • - Looks like C • Portable • Compiles Cg programs to assembly code

  7. Block diagram of GeForce FX • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transfor- mations and lighting (T&L) • Pixel processor :- execute pixel shaders or emulate fixed shaders- 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators- interpolate texture coordinates and color values Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

  8. Monte Carlo simulations • Used in variety of simulations in physics, finance, chemistry, etc. • Based on probability statistics and use random numbers • A classical example: compute area of a circle • Computation of expected values: N can be very large : on a 1024 x 1024 lattice of particles, every particle modeled to have k states, N = • Random number generation. We used linear congruential type generator: (1)

  9. Ising model • Simplified model for magnets(introduced by Wilhelm Lenz in 1920, further studied by his student Ernst Ising) • Modeled on 2D lattice with a “spin” (corresponding to orientation of electrons) at every cell pointing up or down • Uses temperature to couple 2 opposing physical principles - minimization of the system's energy - entropy maximization • Want to compute - expected magnetization: - expected energy: • Evolve the system into “higher probability” states and compute expected values as average over those states - evolving from state to state, based on certain probability decision, is related to so called Markov chains: W.Gilks, S.Richardson, and D.Spiegelhalter (Editors), Markov chain Monte Carlo in Practice, Chapman&Hall, 1996.

  10. Ising model computational procedure • Choose an absolute temperature of interest T (in Kelvin) • Color lattice in a checkerboard manner • Start consecutive black and white “sweeps” • Change the spin at a site based on the procedure 1. Denote current state as S, the state with flipped spin as S' 2. Compute 3. If accept S' else generate and accept S' if, where P(S) is given by the Boltzmann probability distribution function

  11. Percolation model • First studied by Broadbent and Hemmercley in 1957 • Used in studies of disordered medium (usually specified by a probability distribution) • Applied in studies of various phenomena such as spread of diseases, flow in porous media, forest fire propagation, clustering, etc. • Of particular interest are: - media modeling threshold after which there exists a “spanning cluster” - relations between different media models - time to reach steady state spanning cluster

  12. Implementation Approaches: • Pure OpenGL (simulations using the fixed-function pipeline) • Shaders in assembly • Shaders in Cg Dynamic texturing: • Create a texture T (think of a 2D lattice) • Loop: • - Render an image using T (in an off-screen buffer) • - Update T from the resulting image

  13. Implementation • Specifics: • We used GL_TEXTURE_RECTANGLE_NV textures - allows dimensions not to be power of 2 • glTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …) - to initialize texture from main memory • glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …) - to replace (copy) results from the p-buffer to texture • p-buffer implementation from the Cg toolkit distribution - http://developer.nvidia.com/object/cg_toolkit.html • fp30 profile with options set by cgGLSetOptimalOptions • glFinish to enforce OpenGL requests completion before measuring execution time (gettimeofday)

  14. Performance results and analysis • Time in s. (approximate) for different vector flops on the GPU: operation in form x = c, x += c, x = sin( c) • 48 B per node – speed limited by GPU’s memory speed (16 GB/s)  0.39 Gflops/s; 3.5 Gflops/s if overheads (in =) are excluded > 20 x faster then CPU but the operations are of low accuracy • Time in s. (approximate) including traffic for different vector flops on the CPU: 32 B per node – speed limited by CPU’s memory speed (4.2 GB/s)

  15. Performance results and analysis • GPU and CPU (2.8 GHz) performance on the Ising model on Linux Red Hat 9, nVidia display driver 4363 from April, 2003 • fragment program compiled to 109 instructions (cgc –profile fp30 ising.cg) performance on 256x256 lattice: 256x256x109x4/0.0081  3.5 Gflops/s • driver 4191 (December, 2002) – problems related to floating point p-buffer driver 4363 (April, 2003) – used for the results reported driver 4496 (July, 2003) – observed speedup of  2 times compared • driver 4363, i.e.  7 Gflops/s = 44% of theoretical max performance

  16. Performance results and analysis • NV30 does not support branching in fragment programs - if / else would get executed for time as if all statements are executed - modeling the checkerboard pattern with 2 lattices would increase the speed by a factor of 2 • Performance is compatible with visualization related sample shaders from http://www.shaders.org/ • Cg assembly • - Performance is the same for using runtime Cg or the generated assembly code • - The assembly code generated is not optimal: we found cases where the code could be optimized and performance increased

  17. Extensions and future work • Code optimization (through optimization of Cg generated assembly) • More applications: • - QCD ? • - Fluid flow ? • Parallel algorithms (or just as a coprocessor) • - domain decomposition type in cluster environment • - Motivation: communication rates CPU GPU for lattices of different sizes in seconds Not a bottleneck in cluster with 1Gbit network

  18. Conclusions • GPUs have higher rate of performance increase over time than CPUs • - GPU studies are valuable research for the future • In certain applications GPUs are 30 to 60 times faster than CPUs • for low precision computations (depending on configuration) • For certain floating point applications GPU’s and CPU’s performance is comparable • - can be used as coprocessor • GPUs are often constrained in memory, but • Preliminary results show it is feasible to use GPUs in parallel • Cg is a convenient tool (but cgc could be optimized) • It is feasible to use GPUs for numerical simulations • - we demonstrated it by implementing 2 models (with many applications), and • - used the implementation in benchmarking NV30

More Related