gpu data formatting and addressing

gpu data formatting and addressing PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on
  • Presentation posted in: Sports / Games

Overview. GPU Memory ModelGPU-Based Data StructuresPerformance Considerations. GPU memory model. GPU Data StorageVertex dataTexture dataFrame buffer. Vertex Data. Texture Data. Frame Buffer(s). PS3.0 GPUs. GPU memory model. Read-OnlyTraditional use of GPU memoryCPU writes, GPU readsRead/WriteSave frame buffer(s) for later use as texture or vertex arraySave up to 16, 32-bit floating values per pixelMultiple Render Targets (MRTs).

Download Presentation

gpu data formatting and addressing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1. GPU Data Formatting and Addressing Aaron Lefohn University of California, Davis

3. GPU memory model GPU Data Storage Vertex data Texture data Frame buffer er result to ``other GPU memory'' (i.e., texture) - Write directly to ``other GPU memory'' instead of framebuffer. - Does the OS or OpenGL own GPU memory? - What other memory can we write to? - Textures - Vertex array buffers? - Fbuffer? - Mechanisms by which GPU can write to its own memory - Copy from framebuffer/pbuffer to texture - Cross platform - 2D output, save in 1D, 2D, 3D texture memory - Slow... - WGL_ARB_render_texture - RTT using pbuffers (only on Windows) - Fast RTT, but context switch is slow (time this!) - Current state of the art and lots of hacks to speed up - See next section for details of hackery - GL_EXT_Render_Target - Lightweight extension to enable x-platform, efficient RTT. - Spec. not yet approved and no implemenation - GL_EXT_pixel_buffer_object - Copy from frame buffer to vertex buffer - Asynchronous CPU readbacks - Supported by current NVIDIA drivers - TODO: Can I talk about this? - Uber buffers - General memory model for GPUs - Textures, frame buffers, vertex buffers are all just ``memory'' - Render to any GPU memory: N-D Texture, Vertex arrays, stencil bufer, frame buffer, etc. - Cross platform (OpenGL owns the memory, not the OS) - Mix-and-match depth buffers/color buffers/etc. - Alpha ATI drivers and spec. not approved - Stream/GPU-Based Data Structures 1) Multi-dimensional streams - Read/Write GPU memory optimized for 2D (images!) - But isn't memory all really 1D? - Yes, but GPU memory heirarchy is optimized for 2D accesses. Texture caches must capture multidimensional locality for texture filtering and 2D rasterization. - Reference texture cache stanford paper. - Result is that GPGPU programmer should use illusion of 2D physical memory. - Large 1D streams - Lay out in 2D - 3D streams - Update slice-by-slice (potentially limits parallelism) - Flatten parts or all into large 2D texture(s) - Streams of higher dimension (> 3D) - Layout in 2D memory in the same way that N-D arrays use 1D CPU memory. - 2D memory is limited in size. 4) How does the GPU get memory addresses? - Per-Vertex - Vertex attributes - Computed in vertex program - Read from vertex texture - Per-Fragment - Per-vertex addresses interpolated by rasterizer - Computed in fragment stage - Read from texture memory 2) Pointers - Dependent texture lookups 3) Sparse Data - Two options - Store entire dataset on GPU and create substreams out of it (depth culling or geometry-based substreams). - Sherbondy et al., IEEE Visualization 2003 - Purcell et al. - Only store sparse data on GPU (in packed format) - Sparse matrices: Kruger, Bolz - Sparse volume: Lefohn - Performance - Pbuffers - Currently the state-of-the-art for RTT - Most implementations optimized for RGBA??? (TODO: Is this true?) - Avoid context switches (TIME this!!) - Pack scalar data into RGBA channels - Use multiple surfaces (front/back/aux0/...) - Pack 2D domains into larger buffers (dangerous!) - Texture Cache Considerations - Caches designed to capture 2D locality wrt. to rasterization and texture filtering. - Dependent Texture Reads - NVIDIA: Based on cache locality - ATI: ??? - Compute addresses at lowest possible computational frequency - Neighbor offsets in vertex program - Avoid fragment-level address computation whenever possibleer result to ``other GPU memory'' (i.e., texture) - Write directly to ``other GPU memory'' instead of framebuffer. - Does the OS or OpenGL own GPU memory? - What other memory can we write to? - Textures - Vertex array buffers? - Fbuffer? - Mechanisms by which GPU can write to its own memory - Copy from framebuffer/pbuffer to texture - Cross platform - 2D output, save in 1D, 2D, 3D texture memory - Slow... - WGL_ARB_render_texture - RTT using pbuffers (only on Windows) - Fast RTT, but context switch is slow (time this!) - Current state of the art and lots of hacks to speed up - See next section for details of hackery - GL_EXT_Render_Target - Lightweight extension to enable x-platform, efficient RTT. - Spec. not yet approved and no implemenation - GL_EXT_pixel_buffer_object - Copy from frame buffer to vertex buffer - Asynchronous CPU readbacks - Supported by current NVIDIA drivers - TODO: Can I talk about this? - Uber buffers - General memory model for GPUs - Textures, frame buffers, vertex buffers are all just ``memory'' - Render to any GPU memory: N-D Texture, Vertex arrays, stencil bufer, frame buffer, etc. - Cross platform (OpenGL owns the memory, not the OS) - Mix-and-match depth buffers/color buffers/etc. - Alpha ATI drivers and spec. not approved - Stream/GPU-Based Data Structures 1) Multi-dimensional streams - Read/Write GPU memory optimized for 2D (images!) - But isn't memory all really 1D? - Yes, but GPU memory heirarchy is optimized for 2D accesses. Texture caches must capture multidimensional locality for texture filtering and 2D rasterization. - Reference texture cache stanford paper. - Result is that GPGPU programmer should use illusion of 2D physical memory. - Large 1D streams - Lay out in 2D - 3D streams - Update slice-by-slice (potentially limits parallelism) - Flatten parts or all into large 2D texture(s) - Streams of higher dimension (> 3D) - Layout in 2D memory in the same way that N-D arrays use 1D CPU memory. - 2D memory is limited in size. 4) How does the GPU get memory addresses? - Per-Vertex - Vertex attributes - Computed in vertex program - Read from vertex texture - Per-Fragment - Per-vertex addresses interpolated by rasterizer - Computed in fragment stage - Read from texture memory 2) Pointers - Dependent texture lookups 3) Sparse Data - Two options - Store entire dataset on GPU and create substreams out of it (depth culling or geometry-based substreams). - Sherbondy et al., IEEE Visualization 2003 - Purcell et al. - Only store sparse data on GPU (in packed format) - Sparse matrices: Kruger, Bolz - Sparse volume: Lefohn - Performance - Pbuffers - Currently the state-of-the-art for RTT - Most implementations optimized for RGBA??? (TODO: Is this true?) - Avoid context switches (TIME this!!) - Pack scalar data into RGBA channels - Use multiple surfaces (front/back/aux0/...) - Pack 2D domains into larger buffers (dangerous!) - Texture Cache Considerations - Caches designed to capture 2D locality wrt. to rasterization and texture filtering. - Dependent Texture Reads - NVIDIA: Based on cache locality - ATI: ??? - Compute addresses at lowest possible computational frequency - Neighbor offsets in vertex program - Avoid fragment-level address computation whenever possible

4. GPU memory model Read-Only Traditional use of GPU memory CPU writes, GPU reads Read/Write Save frame buffer(s) for later use as texture or vertex array Save up to 16, 32-bit floating values per pixel Multiple Render Targets (MRTs)

5. How to Save Render Result Copy framebuffer result to “other GPU memory” Copy-to-texture Copy-to-vertex-array Write directly to “other GPU memory'' Render-to-texture Render-to-vertex-array

6. OpenGL GPU Memory Writes Texture Copy frame buffer to texture Render-to-texture WGL_ARB_render_texture GL_EXT_render_target Superbuffers Vertex Array Copy frame buffer to vertex array GL_EXT_pixel_buffer_object Superbuffers Render-to-vertex-array Superbuffers

7. Render-To-Texture: 1 Copy-To-Texture Good Cross-Platform texture writes Flexible output 2D output ? Copy to 1D, 2D, or 3D texture Bad Slow Consumes internal GPU memory bandwidth

8. Render-To-Texture: 2 WGL_ARB_render_texture Render-to-texture (RTT) using pbuffers http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txt Good Fast RTT Current state of the art for RTT Bad Only works on Windows Slow OpenGL context switches Many hacks to avoid this bottleneck

9. Render-To-Texture: 3 GL_EXT_render_target Proposed extension for cross-platform RTT http://www.opengl.org/resources/features/GL_EXT_render_target.txt Good Cross-platform, efficient RTT solution Lightweight, simple extension Bad Specification not approved (April 24, 2004) No implementations exist (April 24, 2004)

10. Render-To-Texture: 4 Superbuffers Proposed new memory model for GPUs http://www.ati.com/developer/gdc/SuperBuffers.pdf Good Unified GPU memory model Render to any GPU memory Cross platform (OpenGL owns memory, not OS) Mix-and-match depth/stencil/color buffers Bad Large, complex extension Specification not approved (April 24, 2004) Only driver support is alpha version (ATI)

11. Render-To-Texture Summary OpenGL RTT Currently Only Under Windows Pbuffers Complex and awkward RTT mechanism Current state of the art Cross-Platform RTT Coming Soon…

12. Render-To-Vertex-Array: 1 GL_EXT_pixel_buffer_object Copy framebuffer to vertex buffer object http://developer.nvidia.com/object/nvidia_opengl_specs.html Good Only GPU/AGP memory bandwidth Works with current drivers (NVIDIA) Bad No direct render-to-vertex-array (slower than true RTVA) No ATI implementation

13. Render-To-Vertex-Array: 2 Superbuffers Write to “memory object” as render target Read from “memory object” as vertex array Good Direct render-to-vertex-array (fast) Bad Can render results always be interpreted as vertex data? Large, complex, unapproved extension, …

14. Render-To-Vertex-Array Summary Current OpenGL Support NVIDIA: GL_EXT_pixel_buffer_object ATI: Superbuffers Semantics Still Under Development…

15. Fbuffer: Capturing Fragments Idea “Rasterization-Order FIFO Buffer” Render results are fragment values instead of pixel values Mark and Proudfoot, Graphics Hardware 2001 http://graphics.stanford.edu/projects/shading/pubs/hwws2001-fbuffer/ Uses Designed for multi-pass rendering with transparent geometry New possibilities for GPGPU? Varying number of results per pixel RTT and RTVA with an fbuffer?

16. Fbuffer: Capturing Fragments Implementations ATI Radeon 9800 and newer ATI GPUs Not yet exposed to user (ask for it!) Problems Size of fbuffer is not known before rendering GPUs cannot perform dynamic memory allocation How to handle buffer overflow?

17. Overview GPU Memory Model GPU-Based Data Structures Performance Considerations

18. GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

19. GPU Memory Addresses Where Are Addresses Generated? CPU Vertex stream or textures Vertex processor Input stream, ALU ops or textures Rasterizer Interpolation Fragment processor Input stream, ALU ops or textures

20. GPU Memory Addresses Where Are Addresses Used? Vertex textures (PS3.0 GPUs) Fragment textures

21. GPU Memory Addresses Pointers Store addresses in texture Dependent texture read Example: See Tim Purcell’s ray tracing talk float2 addr = tex2D( addrTex, texCoord ); float2 data = tex2D( dataTex, addr );

22. GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

23. Multi-Dimensional Arrays Build Data Structures in 2D Memory Read/Write GPU memory optimized for 2D Images But Isn’t Physical Memory 1D? GPU memory hierarchy optimized to capture 2D locality Rasterization Texture filtering Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture Cache Architecture,” Graphics Hardware, 1998 Conclusion: Use illusion of 2D physical memory

24. GPU Arrays Large 1D Arrays Current GPUs limit 1D array sizes to 2048 or 4096 Pack into 2D memory 1D-to-2D address translation

25. GPU Arrays 3D Arrays Problem GPUs do not have 3D frame buffers No RTT to slice of 3D texture (except Superbuffers) Solutions Stack of 2D slices Multiple slices per 2D buffer

26. GPU Arrays Problems With 3D Arrays for GPGPU Cannot read stack of 2D slices as 3D texture Must know which slices are needed in advance Visualization of 3D data difficult Solutions Need render-to-slice-of-3D-texture (Superbuffers) Volume rendering of slice-based 3D data Course 28, “Real-Time Volume Graphics”, Siggraph 2004

27. GPU Arrays Higher Dimensional Arrays Pack into 2D buffers N-D to 2D address translation Same problems as 3D arrays if data does not fit in a single 2D texture Conclusions Fundamental GPU memory primitive is a fixed-size 2D array GPGPU needs more general memory model

28. GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

29. Sparse Data Structures Why Sparse Data Structures? Reduce computational workload Reduce memory pressure Examples Sparse matrices Krueger et al., Siggraph 2003 Bolz et al., Siggraph 2003 Implicit surface computations (sparse volumes) Sherbondy et al., IEEE Visualization 2003 Lefohn et al., IEEE Visualization 2003

30. Sparse Computation Option 1: Store Complete Data Set on GPU Cull unused data Conditional execution tricks (discussed earlier) Option 2: Store Only Sparse Data on GPU Saves memory Potentially much faster than culling Much more complicated (especially if time-varying)

31. Sparse Data Structures Basic Idea Pack “active” data elements into GPU memory For more information Linear algebra section in this course : Static structures Level-set case study in this course : Dynamic structures

32. Sparse Data Structures Addressing Sparse Data Neighborhoods no longer implicitly defined on grid Use pointer-based data structures to locate neighbors Pre-compute neighbor addresses if possible Use CPU or vertex processor Removes pointer dereference from fragment program Separate common addressing case from boundary conditions Common case must be cache coherent See Harris and Lefohn case studies for “substream” technique

33. Overview GPU Memory Model GPU-Based Data Structures Performance Considerations

34. Memory Performance Issues Pbuffer Survival Guide Dependent Texture Costs Computational Frequency

35. Pbuffer Survival Guide Pbuffers Give us Render-To-Texture Designed to create an environment map or two Never intended to be used for GPGPU (100s of pbuffers) Problem Each pbuffer has its own OpenGL render context Each pbuffer may have depth and/or stencil buffer Changing OpenGL contexts is slow Solution Many optimizations to avoid this bottleneck…

36. Pbuffer Survival Guide Pack Scalar Data Into RGBA > 4x memory savings 4x reduction in context switches Be careful of read-modify-write hazard

37. Pbuffer Survival Guide Use Multi-Surface Pbuffers Each RGBA surface is its own render-texture Front, Back, AuxN (N = 0,1,2,…) Greatly reduces context switches Technically illegal, but “blessed” by ATI. Works on NVIDIA.

38. Pbuffer Survival Guide Using Multi-Surface Pbuffers Allocate double buffer pbuffer (and/or with AUX buffers) Set render target to back buffer glDrawBuffer(GL_BACK) Bind front buffer as texture wglBindTexImageARB(hpbuffer, WGL_FRONT_ARB) Render Switch buffers wglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB) glDrawBuffer(GL_FRONT) wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)

39. Pbuffer Survival Guide Pack 2D domains into large buffer “Flat 3D textures” Be careful of read-modify-write hazard

40. Dependent Texture Costs Cache Coherency Dependent reads fast if they hit cache Even chained dependencies can be same speed as non-dependent reads Very slow if out of cache Example: 3 levels of dependent cache misses can be >10x slower More detail in “GPU Computation Strategies and Tricks”

41. Computational Frequency Compute Memory Addresses at Low Frequency Compute memory addresses in vertex program Let rasterizer interpolation create per-fragment addresses Compute neighbor addresses this way Avoid fragment-level address computation whenever possible Consumes fragment instructions Computation often redundant with neighboring fragments May defeat texture pre-fetch

42. Conclusions GPU Memory Model Evolving Writable GPU memory forms loop-back in an otherwise feed-forward streaming pipeline Memory model will continue to evolve as GPUs become more general stream processors GPGPU Data Structures Basic memory primitive is limited-size, 2D texture Use address translation to fit all array dimensions into 2D Maintain 2D cache locality Render-To-Texture Use pbuffers with care and eagerly adopt their successor

43. Selected References J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” SIGGRAPH 2003 N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” Graphics Hardware 2003 M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,“ Graphics Hardware 2003 H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture,” Graphics Hardware 1998 J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms,” SIGGRAPH 2003 A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE Transactions on Visualization and Computer Graphics 2004

44. Selected References A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Visualization 2003 W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering,” Graphics Hardware 2001 T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware,” Graphics Hardware 2003 A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware,” IEEE Visualization 2003

45. OpenGL References GL_EXT_pixel_buffer_objecthttp://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt GL_EXT_render_target, http://www.opengl.org/resources/features/GL_EXT_render_target.txt OpenGL Extension Registryhttp://oss.sgi.com/projects/ogl-sample/registry/ Superbuffershttp://www.ati.com/developer/gdc/SuperBuffers.pdf WGL_ARB_render_texturehttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txthttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_pbuffer.txt

46. Questions? Acknowledgements Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA Mark Segal, Rob Mace, and Evan Hart at ATI GPGPU Siggraph 2004 course presenters Joe Kniss and Ross Whitaker Brian Budge John Owens National Science Foundation Graduate Fellowship Pixar Animation Studios

  • Login