1 / 47

Memory-Savvy Distributed Interactive Ray Tracing

Memory-Savvy Distributed Interactive Ray Tracing. David E. DeMarle Christiaan Gribble Steven Parker. Impetus for the Paper. data sets are growing memory access time is a bottleneck use parallel memory resources efficiently three techniques for faster access to scene data. System Overview.

kawena
Download Presentation

Memory-Savvy Distributed Interactive Ray Tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker

  2. Impetus for the Paper • data sets are growing • memory access time is a bottleneck • use parallel memory resources efficiently • three techniques for faster access to scene data

  3. System Overview • base system presented at IEEE PVG’03 • cluster port of an interactive ray tracer for shared memory supercomputers IEEE VIS’98 • image parallel work division • fetch scene data over from peers and cache locally

  4. Three Techniques forMemory Efficiency • ODSM  PDSM • central work queue  distributed work sharing • polygonal mesh reorganization

  5. Distributed Shared Memory • data is kept in memory blocks • each node has 1/nth of the blocks • fetch rest over the network from peers • cache recently fetched blocks abstract view of memory 1 2 3 4 5 6 7 8 9 resident set cache node 1’s memory 1 4 7 2 node 2’s memory 2 5 8 3 node 3’s memory 3 6 9 2 4

  6. Object Based DSM • each block has a unique handle • application finds handle for each datum • acquire and release for every block access //locate data handle, offset = ODSM_location(datum); block_start_addr = acquire(handle); //use data datum = *(block_start_addr + offset); //relinquish space release(handle);

  7. ODSM Observations • handle = level of indirection  > 4 GB • mapping scene data to blocks is tricky • acquire and release add overhead • address computations add overhead 7.5 GB Richtmyer-Meshkov time step 64 CPUs ~3fps, with view and isovalue changes

  8. Page Based DSM • like ODSM: • each node keeps 1/nth of scene • fetches from peers • uses caching • difference is how memory is accessed • normal virtual memory addressing • use addresses between heap and stack • PDSM installs a segmentation fault signal handler: on a miss  obtain page from peer, return

  9. PDSM Observations • no handles, normal memory access • no acquire/release or address computations  • easy to place any type of scene data in shared space  • limited to 2^32 bytes  • hard to make thread safe  • DSM acts only in the exceptional case of a miss • ray tracing acceleration structure  > 90 % hit rates

  10. Head-to-Head Comparison • compare replication, PDSM and ODSM • use a small 512^3 volumetric data set • PDSM and ODSM keep only 1/16th locally • change viewpoint and isovalue throughout • first half, large working set • second half, small working set

  11. Head-to-Head Comparison note - accelerated ~2x for presentation

  12. Head-to-Head Comparison

  13. Head-to-Head Comparison replicated 3.74 frames/sec average

  14. Head-to-Head Comparison ODSM 32% speed of replication

  15. Head-to-Head Comparison PDSM 82% speed of replication

  16. Three Techniques forMemory Efficiency • ODSM  PDSM • central work queue  distributed work sharing • polygonal mesh reorganization

  17. Load Balancing Options • central work queue • legacy from original shared memory implementation • display node keeps task queue • render nodes get tiles from queue • now  distributed work sharing • start with tiles traced last frame  hit rates increase • workers get tiles from each other  communicate in parallel, better scalability • steal from random peers, slowest worker gives work

  18. Central Work Queue Distributed Work Sharing Supervisor node tile 0 tile 1 tile 2 tile 3 … tile 0 … Worker node 0 Worker node 0 tile 1 … Worker node 1 Worker node 1 tile 2 … Worker node 2 Worker node 2 tile 3 … Worker node 3 Worker node 3 … …

  19. Central Work Queue Distributed Work Sharing

  20. Central Work Queue Distributed Work Sharing

  21. Central Work Queue Distributed Work Sharing

  22. Central Work Queue Distributed Work Sharing

  23. Comparison • bunny, dragon, and acceleration structures in PDSM • measure misses and frame rates • vary local memory to simulate data much larger than physical memory

  24. 1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing

  25. 1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing

  26. 1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing

  27. 1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing

  28. Three Techniques forMemory Efficiency • ODSM  PDSM • central work queue  distributed work sharing • polygonal mesh reorganization

  29. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3

  30. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3

  31. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3

  32. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3

  33. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3

  34. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &3 &4 &5 &9 &10 &11 &15 &16 &17 &2 &3 &6 &7 … … &0 &1 &2 &6 &7 &8 &12 &13 &14 &0 &1 &4 &5

  35. Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &3 &4 &5 &9 &10 &11 &15 &16 &17 &2 &3 &6 &7 … … &0 &1 &2 &6 &7 &8 &12 &13 &14 &0 &1 &4 &5

  36. Input Mesh

  37. Sorted Mesh

  38. Reorganizing the Mesh • based on a grid acceleration structure • each grid cell contains pointers to triangles within • our grid structure is bricked in memory • create grid acceleration structure • traverse the cells as stored in memory • append copies of the triangles to a new mesh • new mesh has triangles sorted in space and memory

  39. Comparison • same test as before • compare input and sorted mesh

  40. Misses Frames/Sec MB locally input mesh sorted mesh

  41. Misses Frames/Sec MB locally input mesh sorted mesh

  42. Misses Frames/Sec MB locally input mesh sorted mesh

  43. Misses Frames/Sec MB locally input mesh sorted mesh

  44. grid based approach  duplicates split triangles Frames/Sec MB locally input mesh sorted mesh

  45. Summary three techniques for more efficient memory use: • PDSM adds overhead only in the exceptional case of data miss • reuse tile assignments with parallel load balancing heuristics • mesh reorganization puts related triangles onto nearby pages

  46. Future Work • need 64-bit architecture for very large data • thread safe PDSM for hybrid parallelism • distributed pixel result gathering • surface based mesh reorganization

  47. Acknowledgments • Funding agencies • NSF 9977218, 9978099 • DOE VIEWS • NIH • Reviewers - for tips and seeing through the rough initial data presentation • EGPGV Organizers • Thank you!

More Related