300 likes | 639 Views
Interactive Distributed Ray Tracing of Highly Complex Models Ingo Wald University of Saarbr ü cken http://graphics.cs.uni-sb.de/~wald http://graphics.cs.uni-sb.de/rtrt Reference Model (12.5 million tris) Power Plant - Detail Views Previous Work
E N D
Interactive Distributed Ray Tracing of Highly Complex Models Ingo Wald University of Saarbrücken http://graphics.cs.uni-sb.de/~wald http://graphics.cs.uni-sb.de/rtrt
Previous Work • Interactive Rendering of Massive Models (UNC) • Framework of algorithms • Textured-depth-meshes (96% reduction in #tris) • View-Frustum Culling & LOD (50% each) • Hierarchical occlusion maps (10%) • Extensive preprocessing required • Entire model: ~3 weeks (estimated) • Framerate (Onyx): 5 to 15 fps • Needs shared-memory supercomputer
Previous Work II • Memory Coherent RT, Pharr (Stanford) • Explicit cache management for rays and geometry • Extensive reordering and scheduling • Too slow for interactive rendering • Provides global illumination • Parallel Ray-Tracing, Parker et al. (Utah) &Muus (ARL) • Needs shared-memory supercomputer • Interactive Rendering with Coherent Ray Tracing (Saarbrücken, EG 2001) • IRT on (cheap) PC systems • Avoiding CPU stalls is crucial
Previous Work: Lessons Learned… • Rasterization possible for massive models …… but not ‘straightforward’ (UNC) • Interactive Ray Tracing is possible (Utah,Saarbrücken) • Easy to parallelize • Cost is only logarithmic in scene size • Conclusion: Parallel, Interactive Ray Tracing should work great for Massive Models
Parallel IRT • Parallel Interactive Ray Tracing • Supercomputer: more threads… • PCs: Distributed IRT on CoW • Distributed CoW: Need fast access to scene data • Simplistic access to scene data • mmap+Caching, all done automatically by OS • Either: Replicate scene • Extremely inflexible • Or: Access to single copy of scene over NFS (mmap) • Network issues: Latencies/Bandwidth
Simplistic Approach Caching via OS support won’t work: • OS can’t even address more than 2Gb of data… • Massive Models >> 2Gb ! • Also an issue when replicating the scene… • Process stalls due to demand paging • stalls very expensive ! • Dual-1GHz-PIII: 1 ms stall = 1 million cycles = about 1000 rays ! • OS automatically stalls process reordering impossible…
Distributed Scene Access • Simplistic approach doesn’t work… • Need ‘manual’ caching and memory management
Caching Scene Data • 2-Level Hierarchy of BSP-Trees • Caching based on self-contained “voxels“ • Clients need only top-level bsp (few kb) • Straightforward implementation…
Caching Scene Data • Preprocessing: Splitting Into Voxels • Simple spatial sorting (bsp-tree construction) • Out-of-core algorithm due to model size • Filesize-limit and address space (2GB) • Simplistic implementation: 2.5 hours • Model Server • One machine serves entire model Single server = Potential bottleneck ! • Could easily be distributed
Hiding CPU Stalls • Caching alone does not prevent stalls ! • Avoiding Stalls Reordering • Suspend rays that would stall on missing data • Fetch missing data asynchronously ! • Immediately continue with other ray • Potentially no CPU stall at all ! • Resume stalled rays after data is available • Can only hide ‘some’ latency Minimize voxel-fetching latencies
Reducing Latencies • Reduce Network Latencies • Prefetching ? • Hard to predict data accesses several ms is advance ! • Latency is dominated by transmission time (100Mbit/s 1MB = 80ms = 160 million cycles !!!) • Reduce transmitted data volume
Reducing Bandwidth • Compression of Voxel Data • LZO-library provides for 3:1 compression • If compared to original transmission time, decompression cost is negligible ! • Dual-CPU system: Sharing of Voxel Cache • Amortize bandwidth, storage and decompression effort over both CPUs… Even better for more CPUs
Load Balancing • Load Balancing • Demand driven distribution of tiles (32x32) • Buffering of work tiles on the client • Avoid communication latency • Frame-to-Frame Coherence Improves Caching • Keep rays on the same client • Simple: Keep tiles on the same client (implemented) • Better: Assign tiles based on reprojected pixels (future)
Results • Setup • Seven dual Pentium-III 800-866 MHzas rendering clients • 100 Mbit FastEthernet • One display & model server (same machine) • GigabitEthernet (already necessary for pixels data) • Powerplant Performance • 3-6 fps in pure C implementation • 6-12 fps with SSE support
Animation:Framerate vs. Bandwidth Latency-hiding works !
Scalability Server bottleneck after 12 CPUs Distribute model server!
Performance: Detail Views Framerate (640x480) : 3.9 - 4.7 fps (seven dual P-III 800-866 Mhz CPUs, NO SSE)
Shadows and Reflections Framerate: 1.4-2.2 fps (NO SSE)
Conclusions • IRT works great for highly complex models ! • Distribution issues can be solved • At least as fast as sophisticated HW-techniques • Less preprocessing • Cheap • Simple & easy to extend (shadows, reflections, shading,…)
Future Work • Smaller cache granularity • Distributed scene server • Cache-coherent load balancing • Dynamic scenes & instances • Hardware support for ray-tracing
Acknowledgments • Anselmo Lastra, UNC • Power plant reference model … other complex models are welcome…
Questions ? For further information visit http://graphics.cs.uni-sb.de/rtrt
Detailed View of Power Plant Framerate: 4.7 fps (seven dual P-III 800-866 Mhz CPUs, NO SSE)
Detail View: Furnace Framerate: 3.9 fps, NO SSE
Overview • Reference Model • Previous Work • Distribution Issues • Massive Model Issues • Images & Demo