470 likes | 608 Views
Memory-Savvy Distributed Interactive Ray Tracing. David E. DeMarle Christiaan Gribble Steven Parker. Impetus for the Paper. data sets are growing memory access time is a bottleneck use parallel memory resources efficiently three techniques for faster access to scene data. System Overview.
E N D
Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker
Impetus for the Paper • data sets are growing • memory access time is a bottleneck • use parallel memory resources efficiently • three techniques for faster access to scene data
System Overview • base system presented at IEEE PVG’03 • cluster port of an interactive ray tracer for shared memory supercomputers IEEE VIS’98 • image parallel work division • fetch scene data over from peers and cache locally
Three Techniques forMemory Efficiency • ODSM PDSM • central work queue distributed work sharing • polygonal mesh reorganization
Distributed Shared Memory • data is kept in memory blocks • each node has 1/nth of the blocks • fetch rest over the network from peers • cache recently fetched blocks abstract view of memory 1 2 3 4 5 6 7 8 9 resident set cache node 1’s memory 1 4 7 2 node 2’s memory 2 5 8 3 node 3’s memory 3 6 9 2 4
Object Based DSM • each block has a unique handle • application finds handle for each datum • acquire and release for every block access //locate data handle, offset = ODSM_location(datum); block_start_addr = acquire(handle); //use data datum = *(block_start_addr + offset); //relinquish space release(handle);
ODSM Observations • handle = level of indirection > 4 GB • mapping scene data to blocks is tricky • acquire and release add overhead • address computations add overhead 7.5 GB Richtmyer-Meshkov time step 64 CPUs ~3fps, with view and isovalue changes
Page Based DSM • like ODSM: • each node keeps 1/nth of scene • fetches from peers • uses caching • difference is how memory is accessed • normal virtual memory addressing • use addresses between heap and stack • PDSM installs a segmentation fault signal handler: on a miss obtain page from peer, return
PDSM Observations • no handles, normal memory access • no acquire/release or address computations • easy to place any type of scene data in shared space • limited to 2^32 bytes • hard to make thread safe • DSM acts only in the exceptional case of a miss • ray tracing acceleration structure > 90 % hit rates
Head-to-Head Comparison • compare replication, PDSM and ODSM • use a small 512^3 volumetric data set • PDSM and ODSM keep only 1/16th locally • change viewpoint and isovalue throughout • first half, large working set • second half, small working set
Head-to-Head Comparison note - accelerated ~2x for presentation
Head-to-Head Comparison replicated 3.74 frames/sec average
Head-to-Head Comparison ODSM 32% speed of replication
Head-to-Head Comparison PDSM 82% speed of replication
Three Techniques forMemory Efficiency • ODSM PDSM • central work queue distributed work sharing • polygonal mesh reorganization
Load Balancing Options • central work queue • legacy from original shared memory implementation • display node keeps task queue • render nodes get tiles from queue • now distributed work sharing • start with tiles traced last frame hit rates increase • workers get tiles from each other communicate in parallel, better scalability • steal from random peers, slowest worker gives work
Central Work Queue Distributed Work Sharing Supervisor node tile 0 tile 1 tile 2 tile 3 … tile 0 … Worker node 0 Worker node 0 tile 1 … Worker node 1 Worker node 1 tile 2 … Worker node 2 Worker node 2 tile 3 … Worker node 3 Worker node 3 … …
Central Work Queue Distributed Work Sharing
Central Work Queue Distributed Work Sharing
Central Work Queue Distributed Work Sharing
Central Work Queue Distributed Work Sharing
Comparison • bunny, dragon, and acceleration structures in PDSM • measure misses and frame rates • vary local memory to simulate data much larger than physical memory
1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing
1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing
1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing
1E6 5E4 Misses 0 20 15 Frames/Sec 10 5 0 MB locally central queue distributed sharing
Three Techniques forMemory Efficiency • ODSM PDSM • central work queue distributed work sharing • polygonal mesh reorganization
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &90 &91 &92 &93 &94 &95 &96 &97 &98 &90 &91 &92 &93 … … &0 &1 &2 &3 &4 &5 &6 &7 &8 &0 &1 &2 &3
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &3 &4 &5 &9 &10 &11 &15 &16 &17 &2 &3 &6 &7 … … &0 &1 &2 &6 &7 &8 &12 &13 &14 &0 &1 &4 &5
Mesh “Bricking” • similar to volumetric bricking • increase hit rates by reorganizing scene data for better data locality • place neighboring triangles on the same page mesh “bricking” volume bricking … … &3 &4 &5 &9 &10 &11 &15 &16 &17 &2 &3 &6 &7 … … &0 &1 &2 &6 &7 &8 &12 &13 &14 &0 &1 &4 &5
Reorganizing the Mesh • based on a grid acceleration structure • each grid cell contains pointers to triangles within • our grid structure is bricked in memory • create grid acceleration structure • traverse the cells as stored in memory • append copies of the triangles to a new mesh • new mesh has triangles sorted in space and memory
Comparison • same test as before • compare input and sorted mesh
Misses Frames/Sec MB locally input mesh sorted mesh
Misses Frames/Sec MB locally input mesh sorted mesh
Misses Frames/Sec MB locally input mesh sorted mesh
Misses Frames/Sec MB locally input mesh sorted mesh
grid based approach duplicates split triangles Frames/Sec MB locally input mesh sorted mesh
Summary three techniques for more efficient memory use: • PDSM adds overhead only in the exceptional case of data miss • reuse tile assignments with parallel load balancing heuristics • mesh reorganization puts related triangles onto nearby pages
Future Work • need 64-bit architecture for very large data • thread safe PDSM for hybrid parallelism • distributed pixel result gathering • surface based mesh reorganization
Acknowledgments • Funding agencies • NSF 9977218, 9978099 • DOE VIEWS • NIH • Reviewers - for tips and seeing through the rough initial data presentation • EGPGV Organizers • Thank you!