1 / 33

Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and Distances. Amirali Baniasadi ECE University of Victoria. Ahmad Lashgar ECE University of Tehran. WDDD-9 June 5, 2011. This Work. Goal : Investigating GPU performance for general-purpose workloads How : Studying the isolated impact of

nolcha
Download Presentation

Performance in GPU Architectures: Potentials and Distances

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance in GPU Architectures: Potentials and Distances Amirali Baniasadi ECE University of Victoria Ahmad Lashgar ECE University of Tehran WDDD-9 June 5, 2011

  2. This Work Goal: Investigating GPU performance for general-purpose workloads How: Studying the isolated impact of • Memory divergence • Branch divergence • Context-keeping resources Key finding: Memory has the biggest impact. Branch divergence solution needs memory consideration. A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  3. Outline Background Performance Impacting Parameters Machine Models Performance Potentials Performance Distances Sensitivity Analysis Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  4. GPU Architecture TPC1 TPC10 SM1 SM1 SM2 SM2 SM3 SM3 . . . . . . . . . DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 MCtrl5 MCtrl2 MCtrl1 MCtrl6 DRAM6 DRAM1 DRAM5 DRAM2 • Number of concurrent CTAs per SM is limited by the size of 3 shared resources: • Thread Pool • Register File • Shared Memory . . . Register File Shared Memory Thread Pool TID CTAID Program Counter . . . . . . . . . . . . . . . Interconnection Network . . . . . . . . . . . . TID CTAID Program Counter … … PE1 PE2 PE31 PE32 L1Data L1Cost L1Text A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  5. Branch Divergence A: // Pre-Divergence if(CONDITION) { B: //NT path } else { C: //T path } D: // reconvergence point • SM is SIMD processor • Group of threads (warp) execute the same instruction on the lanes. • Branch instruction potentially diverge warp to two groups: • Threads with taken outcome • Threads with not-taken outcome A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  6. Control-flow mechanism • Control-flow solutions address this. • Previous solutions: • Postdominator Reconvergence (PDOM) • Masking and serializing in diverging paths, finally reconverging all paths • Dynamic Warp Formulation (DWF) • Regrouping the threads in diverging paths into new warps A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  7. PDOM SIMD Utilization over time TOS TOS Dynamic regrouping of diverged threads at same path increases utilization A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  8. DWF SIMD Utilization over time Warp Pool Merge Possibility A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  9. Performance impacting parameters • Memory Divergence • Increase of memory pressure with un-coalesced memory accesses • Branch Divergence • Decrease of SIMD efficiency with inter-warp diverging-branch • Workload Parallelism • CTA-limiting resources bound memory latency hiding capability • Concurrent CTAs share 3 CTA-limiting resources: • Shared Memory • Register File • Thread Pool A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  10. Machine Models • Isolates the impact of each parameter: X X - Y Y - Z Z DC:DWF Control-flow PC:PDOM Control-flow IC:Ideal Control-flow (MIMD) IM:Ideal Memory M:Real Memory Limited Resources :LR Unlimited Resources :UR A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  11. Machine Models continued… • LR-DC-M • LR-PC-M • LR-IC-M • LR-DC-IM • LR-PC-IM • LR-IC-IM • UR-DC-M • UR-PC-M • UR-IC-M • UR-DC-IM • UR-PC-IM • UR-IC-IM Real-Memory Limited per SM resources Ideal-Memory Real-Memory Unlimited per SM resources Ideal-Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  12. Methodology GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and CUDA SDK 2.3 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  13. Performance Potentials • The speedup can be reached if the impacting parameter is idealized • 3 Potentials (per control-flow mechanism): • Memory Potential • Speedup due to ideal memory • Control Potential • Speedup due to free-of-divergence architecture • Resource Potential • Speedup due to infinite CTA-limiting resources per SM A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  14. Performance Potentials continued… A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  15. Memory Potentials DWF 61% PDOM 59% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  16. Resource Potentials DWF 8.6% PDOM 9.4% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  17. Control Potentials PDOM -7% DWF 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  18. Performance Distances • How much an otherwise ideal GPU is distanced from ideal due to the parameter. • 3 Distances: • Memory Distance • Distance form ideal GPU due to real memory • Resource Distance • Distance from ideal GPU due to limited resources • Control Distance • Distance from ideal GPU due to branch divergence A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  19. Performance Distances continued… A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  20. Memory Distance 40% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  21. Resource Distance 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  22. Control Distances PDOM 8% DWF 15% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  23. Sensitivity Analysis • Validating the findings under aggressive configurations: • Aggressive-Memory • 2x L1 caches • 2x Number of memory controllers • Aggressive-Resource • 2x CTA-limiting resources • Limited to performance potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  24. Aggressive-memory DWF memory potential 28% PDOM memory potential 28% Memory Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  25. Aggressive-memory continued… DWF control potential -0.4% PDOM control potential -8% Control Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  26. Aggressive-memory continued… PDOM resource potential 8% DWF resource potential ~0% Resource Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  27. Aggressive-resource DWF memory potential 52% PDOM memory potential 51% Memory Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  28. Aggressive-resource continued… PDOM control potential -8% DWF control potential 2% Control Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  29. Aggressive-resource continued… PDOM resource potential 4% DWF resource potential 3% Resource Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  30. Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  31. Conclusion • Performance in GPUs • Potentials: Improvement by idealizing • Memory: 59% and 61% for PDOM and DWF • Control: -7% and 2% for PDOM and DWF • Resource: 9.4% and 8.6 for PDOM and DWF • Distances: Distance from ideal system due to a none-ideal factor • Memory: 40% • Control: 8% and 15% for PDOM and DWF • Resource: 2% • Findings: • Memory has the biggest impact among the 3 factors • Improving control-flow mechanism has to consider memory pressure • Same trend under aggressive memory and context-keeping resources A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  32. Thank you. Questions? A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

  33. Why 32 PEs per SM • GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs: • Example: Warp Size = 32, PEs per SM = 8 • 4 independent coalescing domains in a warp • We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs: A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

More Related