1 / 32

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs. Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal. Dept. of Computer Science and Engineering The Ohio State University Columbus, OH, USA. GPU Programming Gets Popular.

benoit
Download Presentation

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer Science and Engineering The Ohio State University Columbus, OH, USA

  2. GPU Programming Gets Popular • Many domains are using GPUs for high performance • GPU-accelerated Molecular Dynamics • GPU-accelerated Seismic Imaging • Available in both high-end/low-end systems • the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012] • commodity desktops/laptops equipped with GPUs

  3. Writing Efficient GPU Programs is Challenging • Need careful management of • a large amount of threads Thread Blocks

  4. Writing Efficient GPU Programs is Challenging • Need careful management of • a large amount of threads • multi-layer memory hierarchy Thread • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache Thread Blocks • DRAM (Device Memory) Kepler GK110 Memory Hierarchy

  5. Writing Efficient GPU Programs is Challenging • Need careful management of • a large amount of threads • multi-layer memory hierarchy Thread • Fast but • Small • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache Thread Blocks • Large but • Slow • DRAM (Device Memory) Kepler GK110 Memory Hierarchy

  6. Writing Efficient GPU Programs is Challenging Thread • Which data in shared memory • are infrequently accessed? • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache • Which data in device memory • are frequently accessed? • DRAM (Device Memory) Kepler GK110 Memory Hierarchy

  7. Writing Efficient GPU Programs is Challenging • Existing tools can’t help much • inapplicable to GPU • coarse-grained • prohibitive runtime overhead • cannot handle irregular/indirect accesses Thread • Which data in shared memory • are infrequently accessed? • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache • Which data in device memory • are frequently accessed? • DRAM (Device Memory) Kepler GK110 Memory Hierarchy

  8. Outline • Motivation • GMProf • Naïve Profiling Approach • Optimizations • Enhanced Algorithm • Evaluation • Conclusions

  9. GMProf-basic: The Naïve Profiling Approach • Shared Memory Profiling • integer counters to count accesses to shared memory • one counter for each shared memory element • atomically update the counter • to avoid race condition among threads • Device Memory Profiling • integer counters to count accesses to device memory • one counter for each element in the user device memory array • since device memory is too large to be monitored as a whole (e..g, 6GB) • atomically update the counter

  10. Outline • Motivation • GMProf • Naïve Profiling Approach • Optimizations • Enhanced Algorithm • Evaluation • Conclusions

  11. GMProf-SA: Static Analysis Optimization • Observation I: Many memory accesses can be determined statically • __shared__ int s[]; • … • s[threadIdx.x]= 3;

  12. GMProf-SA: Static Analysis Optimization • Observation I: Many memory accesses can be determined statically • __shared__ int s[]; • … • s[threadIdx.x]= 3; Don’t need to count the access at runtime

  13. GMProf-SA: Static Analysis Optimization • Observation I: Many memory accesses can be determined statically • __shared__ int s[]; • … • s[threadIdx.x]= 3; Don’t need to count the access at runtime • How about this … • __shared__ float s[]; • … • for(r=0; …; …) { • for(c=0; …; …) { • temp = s[input[c]]; • } • }y

  14. GMProf-SA: Static Analysis Optimization • Observation II: Some accesses are loop-invariant • E.g. s[input[c]] is irrelavant to the outer loop iterator r • __shared__ float s[]; • … • for(r=0; …; …) { • for(c=0; …; …) { • temp = s[input[c]]; • } • }y

  15. GMProf-SA: Static Analysis Optimization • Observation II: Some accesses are loop-invariant • E.g. s[input[c]] is irrelavant to the outer loop iterator r Don’t need to profile in every r iteration • __shared__ float s[]; • … • for(r=0; …; …) { • for(c=0; …; …) { • temp = s[input[c]]; • } • }y

  16. GMProf-SA: Static Analysis Optimization • Observation II: Some accesses are loop-invariant • E.g. s[input[c]] is irrelavant to the outer loop iterator r Don’t need to profile in every r iteration • Observation III: Some accesses are tid-invariant • E.g. s[input[c]] is irrelavant to threadIdx • __shared__ float s[]; • … • for(r=0; …; …) { • for(c=0; …; …) { • temp = s[input[c]]; • } • }y

  17. GMProf-SA: Static Analysis Optimization • Observation II: Some accesses are loop-invariant • E.g. s[input[c]] is irrelavant to the outer loop iterator r Don’t need to profile in every r iteration • Observation III: Some accesses are tid-invariant • E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the counter in every thread • __shared__ float s[]; • … • for(r=0; …; …) { • for(c=0; …; …) { • temp = s[input[c]]; • } • }y

  18. GMProf-NA: Non-Atomic Operation Optimization • Atomic operation cost a lot • Serialize all concurrent threads when updating a shared counter atomicAdd(&counter, 1); … concurrent threads serialized threads … • Use non-atomic operation to update counters • does not impact the overall accuracy thanks to other optimizations

  19. GMProf-SM: Shared Memory Counters Optimization • Make full use of shared memory • Store counters in shared memory when possible • Reduce counter size • E.g., 32-bit integer counters -> 8-bit • Fast but • Small • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache • Device Memory

  20. GMProf-SM: Shared Memory Counters Optimization • Make full use of shared memory • Store counters in shared memory when possible • Reduce counter size • E.g., 32-bit integer counters -> 8-bit • Fast but • Small • Shared • Memory • L1 • Cache • Read-only • Data Cache • L2 Cache • Device Memory GMProf-TH: Threshold Optimization • Precise count may not be necessary • E.g A is accessed 10 times, while B is accessed > 100 times • Stop counting once reaching certain threshold • Tradeoff between accuracy and overhead

  21. Outline • Motivation • GMProf • Naïve Profiling Approach • Optimizations • Enhanced Algorithm • Evaluation • Conclusions

  22. GMProf-Enhanced: Live Range Analysis • The number of accesses to a shared memory location may be misleading • shm_buf in Shared Memory • data0 • data1 • data2 • data0 • data1 • data2 • data0 • data1 • data2 • input_array in Device Memory • output_array in Device Memory • Need to count the accesses/reuse of DATA, not address

  23. GMProf-Enhanced: Live Range Analysis • Track data during its live range in shared memory • Use logical clock to marks the boundary of each live range • Separate counters in each live range based on logical clock • ... • shm_buffer = input_array[0] //load data0 from DM to ShM • ... • output_array[0] = shm_buffer //store data0 from ShM to DM • ... • ... • shm_buffer = input_array[1] //load data1 from DM to ShM • ... • output_array[1] = shm_buffer //store data1 from ShM to DM • ... • live range of data0 • live range of data1

  24. Outline • Motivation • GMProf • Naïve Profiling Approach • Optimizations • Enhanced Algorithm • Evaluation • Conclusions

  25. Methodology • Platform • GPU: NVIDIA Tesla C1060 • 240 cores (30×8), 1.296GHz • 16KB shared memory per SM • 4GB device memory • CPU: AMD Opteron 2.6GHz ×2 • 8GB main memory • Linux kernel2.6.32 • CUDA Toolkit 3.0 • Six Applications • Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix-Vector Multiplication, and DXTC

  26. Runtime Overhead for Profiling Shared Memory Use • 182x • 144x • 648x • 181x • 648x • 113x • 90x 2.6x

  27. Runtime Overhead for Profiling Device Memory Use • 83x • 197x • 48.5x 1.6x

  28. Case Study I: Put the most frequently used data into shared memory • bo_v1: • a naïve implementation where all data arrays are stored in device memory • A1 ~ A4: four data arrays • (N): average access # of the elements in the corresponding data array

  29. Case Study I: Put the most frequently used data into shared memory • bo_v2: • an improved version which puts the most frequently used arrays (identified by GMProf) into shared memory • bo_v2 outperforms bo_v1 by a factor of 39.63

  30. Case Study II: identify the true reuse of data • jcb_v1: • the shared memory is accessed frequently, but little reuse of the date • jcb_v2: • jcb_v2 outperforms jcb_v1 by 2.59 times

  31. Outline • Motivation • GMProf • Naïve Profiling Approach • Optimizations • Evaluation • Conclusions

  32. Conclusions • GMProf • Statically-assisted dynamic profiling approach • Architecture-based optimizations • Live range analysis to capture real usage of data • Low-overhead & Fine-grained • May be applied to profile other events Thanks!

More Related