1 / 20

Presented by: Prashant Revankar

ECE 8001 Advanced Computer Architecture. Paper: Dynamic Warp Subdivision for the Integrated Branch and Memory Divergence Tolerance Jiayuan Meng David Tarjan Kevin Skadron. Presented by: Prashant Revankar. Contents. Introduction Impact of Memory Latency Divergence Methodology

Download Presentation

Presented by: Prashant Revankar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 8001 Advanced Computer Architecture Paper: Dynamic Warp Subdivision for the Integrated Branch and Memory Divergence ToleranceJiayuan Meng David Tarjan Kevin Skadron Presented by: Prashant Revankar

  2. Contents • Introduction • Impact of Memory Latency Divergence • Methodology • Dynamic Warp Subdivision • Upon Branch Divergence • Upon Memory Divergence • Sensitivity Analysis • Conclusion and Future Work

  3. Introduction • Introduction of the Dynamic Warp Subdivision(DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register space. • SIMD(Single Instruction Multiple Data) better than MIMD with respect to data parallelism, throughput, cost etc. (also referred by NVIDIA as SMIT) • WPU – Warp Processing Unit • When a WPU doesn’t have enough warps with all threads ready to execute , some individual threads may still be able to proceed and that happens in following cases: • Branch Divergence • Memory Latency Divergence • Dynamic Warp Subdivision

  4. Impact of Memory Latency Divergence • Figure 1a. - performance scaling when varying the SIMD width from 1 to 16 with four warps sharing a 32 KB L1 D- Cache. • Figure 1b. D-cache association with the time of execution. • Figure 1c. - adding more warps does exacerbate L1 contention. • Figure form the Host Paper

  5. Impact of Memory Latency Divergence contd. • The Benchmarks exhibiting frequent memory Divergence

  6. Methodology • Overall Architecture: • General system that blends CPU and GPU architectures: general purpose ISAs, SIMD organization, multi-threading, and a cache-coherent memory hierarchy. • Combining the aspects of CPU and GPU the Creation of MV5 to simulate WPUs • The simulated applications are programmed in an Open MP-style API implemented in MV5, where data parallelism is expressed in parallel for loops.

  7. Methodology contd. • Benchmarks • These benchmarks are common, data- parallel kernels and applications • They cover the application domain of scientific computing, image processing, physics simulation, machine learning and data mining.

  8. Methodology contd. • Architecture Details: • It has 2level coherent cache hierarchy • Each WPU has private I- and D-caches • Only the LLC communicates with the Main Memory • WPU’s are simulated up to 64 thread contexts and 16lanes. • The experiments in this paper are limited to four WPUs because this provide reasonable simulation time.

  9. Dynamic Warp Subdivision(DWS) • Upon Branch Divergence Occurs when threads in the same warp take different paths upon a conditional branch. • Upon Memory Divergence Occurs when threads from a single warp experience different memory-reference latencies caused by cache miss or accessing different DRAM banks.

  10. Upon Branch Divergence • SIMD width of 8 encounters a conditional branch at the end of the code block. • Re-Convergence of B and C. • This Figure illustrates the conventional mechanism presented by Fung

  11. Examples • Subdivision of Warp into Warp splits

  12. Unrelenting Subdivision • Over Subdivision • PC Based Re-convergence Warp split table;

  13. Results (Branch Divergence) • Comparison with the Benchmarks • Overall, PC-based re-convergence outperforms stack-based re-convergence and it generates an average speedup of 1.13X.

  14. Upon Memory Divergence • Improving Performance • Re-converge

  15. Upon Memory Divergence • Preventing Over-subdivision • BL – Branch Limited Re-convergence

  16. Results(Memory Divergence) • DWS upon branch divergence (DWS.BranchOnly) alone reaches a speedup of 1.13X. • DWS upon memory divergence alone using ReviveSplit (DWS.ReviveSplit.MemOnly) achieves a speedup of 1.20X.

  17. An example of dynamic warp subdivision upon memory divergence • While divergent branches can still be handled using the re-convergence stack (b, f), the warp-split table can be used to create warp-splits using the hit mask that marks threads that hit the cache (c-e).

  18. Sensitivity Analysis • Cache Misses Memory Divergence • Speed-up Vs. D-cache associativity • With larger D-cache associativity, both cache miss rate and memory divergence tend to decrease, and hence there is less latency to hide and less memory level parallelism to exploit. • Miss Latency • The D-cache miss latency is largely dependent on the memory hierarchy.

  19. Conclusion and Future Work • This paper proposes dynamic warp subdivision to create warp-splits in the case that the SIMD processor does not have sufficient warps to hide latency and leverage MLP. • DWS generates an average speedup of 1.7X. • Study the sensitivity of DWS to various architectural parameters and demonstrate that the performance of DWS is robust and it does not degrade performance on any of the benchmarks tested.

  20. Thank you!

More Related