1 / 38

Beyond CUDA/GPUs and Future Graphics Architectures

Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark. Beyond CUDA/GPUs and Future Graphics Architectures.

jiro
Download Presentation

Beyond CUDA/GPUs and Future Graphics Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark. Beyond CUDA/GPUs and Future Graphics Architectures

  2. Real-time Graphics Rendering Today

  3. Real-time Graphics Rendering Future Today

  4. Real-time Graphics Rendering What are the problems? How can we get there?

  5. What is wrong with this picture?

  6. GPU/CUDA Z-buffer

  7. “Ptolemic” Graphic Universe Z-buffer Application Arch • Architecture, application all optimized for Z-buffer • Difficult to render images with realistic effects. • self-reflection, soft shadows, ambient occlusion • Problems: • Scene constraints, Artist and programmer productivity

  8. Current Graphics Architectures Courtesy: ACM Queue

  9. How did we get here? • Hardware Rasterizers and perspective-correct texture mapping (RIVA 128) • Single Pass Multitexture (TNT / TNT2) • Register Combiners: a generalization of multitexture (GeForce 256) • Per-pixel Shading (Geforce 2 GTS) • Programmable Hardware Pixel Shading • Programmable Vertex Shading • CUDA

  10. Algorithm Application Ray-tracing Arch “Copernican” Graphic Universe • Architecture, application revolves around Algorithm • More general purpose algorithm • Easier to provide realistic effects • Architecture can support other applications

  11. Future Graphics Architectures Courtesy: ACM Queue

  12. Executive Summary: Copernicus System Co-designed application, architecture and analysis framework Path from specialized graphics architecture to more general purpose architecture. A detailed characterization and analysis framework Real-time frame rates possible for high quality dynamic scenes

  13. Outline Motivation Copernicus system Graphics Algorithm: Razor Architecture Evaluation and Results Summary

  14. Ray-tracing Full scene Cube Cylinder • Simulating the behavior light rays through 3D scene • Rays from eye to scene (Primary rays) • Rays from hitpoint to light (Secondary rays) • Acceleration structure (eg. BSP Tree) for efficiency

  15. Disadvantages of Raytracing • Every frame need to rebuild the acceleration structure for dynamic scenes. • Irregular data accesses for traversing the acceleration structure. • Higher resolution secondary ray tracing computation

  16. Razor: A Dynamic Multiresolution Raytracer Thread 1 Thread 2 Cylinder Cube • Packet ray-tracer: Traces beam of rays instead of a ray • Opportunity for data level parallelism • Each thread lazily builds its own acceleration structure(KD Tree) • Builds the portion of structure it needs.

  17. Razor: A Dynamic Multiresolution Raytracer • Multi-level resolution to reduce secondary rays computation. • Replicates KD-Tree to reduce synchronization across threads. • Hypothesis: Duplication across threads will be limited.

  18. Razor Implementation • Linux/x86 • Implemented Razor in Intel Clovertown. • Parallelized using pthreads. • Optimized with SSE instructions • Sustains 1 FPS on this prototype system • Helps develop algorithms • Designed with future hardware in mind

  19. Razor’s Memory Usage Memory footprint # Threads

  20. Parallel Scalability Speedup # Threads

  21. Outline • Motivation • Copernicus system • Graphics Algorithm: Razor • Architecture • Evaluation and Results • Summary

  22. Architecture: Core • Inorder core • Private L1 Data and Instruction Cache • Supports SIMD instructions • SMT Threads to hide memory latency

  23. Architecture: Tile • Shared L2 cache • Shared Accelerator for specialized instructions

  24. Architecture: Chip

  25. Architecture Razor Mapping Assigned to Core Assigned to Tile

  26. Outline • Motivation • Copernicus system • Graphics Algorithm: Razor • Architecture • Evaluation and Results • Summary

  27. Benchmark Scenes Courtyard Fairyforest Forest Juarez Saloon v

  28. Evaluation Methodology Simulation with Multifacet/GEMS Simulate SSE Instructions Simulate a full tile Validated with prototype data Pin-based and PAPI-based performance counters Randomly selected regions of scenes Full chip Simulating full chip is too slow Build customized analytic model

  29. Analytical Model Core Level Pipeline stalls Multiple threads Tile Level L2 contention Chip Level Main memory contention Compared with our simulation results

  30. Single Core Performance (Single Issue) IPC

  31. Single Core Performance (Dual Issue) IPC

  32. Single Tile Performance IPC

  33. Full Chip Performance Million Rays/Seconds #Tiles

  34. So, Are we there yet?

  35. Results • Goal: 100 Million rays per second • Achieved: 50 Million rays per second • With 16 tiles and 4 DIMMs • Insights: • 4 SMT single issue is ideal for this workload • Good parallel scalability • Razor’s physically-motivated optimizations work • Potential for further architectural optimizations • Shared accelerator • Wide SIMD bundles

  36. Outline • Motivation • Copernicus system • Graphics Algorithm: Razor • Architecture • Evaluation and Results • Summary

  37. Summary A transformation path to ray-tracing Ptolemic universe to Copernican graphics universe Unique architecture design point Tradeoff data redundancy and re-computation over synchronization Evaluation methodology interesting in its own right Prototype, simulation and analytical framework to design and evaluate future systems Future work Instructions specialization and shared accelerator design Tradeoffs with SIMD width and area Memory system

  38. Other Questions?

More Related