CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications - PowerPoint PPT Presentation

yamin
cuda np realizing nested thread level parallelism in gpgpu applications n.
Skip this Video
Loading SlideShow in 5 Seconds..
CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications PowerPoint Presentation
Download Presentation
CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

play fullscreen
1 / 33
Download Presentation
CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications
132 Views
Download Presentation

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. www.nec-labs.com CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications Yi Yang, NEC Labs Huiyang Zhou, NCSU PPoPP'2014

  2. Outline • Background • Motivation • CUDA-NP • Experiments • Conclusions PPoPP'2014

  3. Background • Many-core architecture • Overcome the limitation of Instruction level parallelism (ILP). • Achieve high performance at lower energy • Thread level parallelism (TLP) has been the key to utilize many-core architectures • CPUs support 10+ threads • Intel Many Integrated Core (MIC) supports 200+ threads • GPGPUs support 10K+ threads • TLP is used to • Occupy a large number of cores. • Hide the off-chip memory latency. PPoPP'2014

  4. GPGPU architecture • Same-instruction multiple-data (SIMD) • A warp of threads (32 threads) executes same instruction on different data • A thread can read registers from another thread in the same warp using shfl instruction (Latest NVIDIA Kepler GPUs) • Memory coalescing • A warp of threads accesses data in a single cache line to maximize memory bandwidth • A thread block contains multiple warps • Threads in the same thread block can communicate using shared memory (software-managed on-chip cache) • Threads in the same thread block run in a SM PPoPP'2014

  5. Parallel programs to enable TLP • Parallel programming languages • OpenMP • CUDA and OpenCL • OpenACC • In order to write a correct parallel program, developers have to • Identify parallel code sections or parallel loops • Modify the code sections or loops using a specific language • In order to achieve high performance • Understand the hardware platform • None of these steps is easy PPoPP'2014

  6. How to write a parallel (CUDA) program void tmv_single_thread (float *a, float*b, float* c, int w, int h){ for (int k=0; k<w; k++) { float sum = 0; for (inti=0; i<h; i++) sum += a[i*w+k]*b[i]; c[tx] = sum; } } Transposed-matrix-vector multiplication (TMV) • Two loops in the single thread program: • Which one do you prefer to parallelize? ----------------------------------- __global__ void tmv_kernel int k = threadIdx.x+blockIdx.x*blockDim.x; ----------------------------------- Why not parallelize inner loop? PPoPP'2014

  7. Outline • Background • Motivation • CUDA-NP • Experiments • Conclusions PPoPP'2014

  8. Why not parallelize the inner loop • Developers prefer to parallelize outer loops. • How to handle reduction or scan variables? (sum+=) • How to utilize the GPGPU features when parallelizing the nested loop? __global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; float sum = 0; for (inti=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum; } Kernel code of Transposed-matrix-vector multiplication (TMV) So we can find nested parallelism in many parallel programs PPoPP'2014

  9. Impact of nested parallelism • The overall thread level parallelism is not utilized • If we parallelize the nested parallelism, we can get more TLP to make use of under-utilized resources • 10K threads per GPU = 100 threads from outer loop X 100 threads from inner loop • The workload/resource of each thread is heavy • If we parallelize the nested parallelism, we can reduce the workload/resource per thread • With limited resources, we can have more threads PPoPP'2014

  10. NVIDIA dynamic parallelism • NVIDIA dynamic parallelism: launch child kernels in a GPU thread • Memory-copy microbenchmark • We launch each child kernel using a parent thread • Each thread of child kernel copies an element • The overall data to be copied (number of child kernel * number of thread per child kernel): 64m floats • For same overall workload, increasing the number of child kernels reduces the performance. • E.G. 4K child kernel launches (16k threads per child kernel: 34GB/s • 142 GB/S without dynamic parallelism • Up to 63 GB/S with enabled dynamic parallelism PPoPP'2014

  11. Limitation of NVIDIA dynamic parallelism • Child kernel launch overhead • Communication between child kernel and parent kernel • Significant overhead as it has to go through global memory • Complicate the code development • Not good for the applications with small loop counts PPoPP'2014

  12. Outline • Background • Motivation • CUDA-NP • Experiments • Conclusions PPoPP'2014

  13. Our solution: CUDA-NP • Developers add an OpenMP-like pragma to the parallel loop • Our compiler framework generates the optimized code leveraging nested parallelism __global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; #pragma np parallel for reduction(+:sum) for (inti=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum; } Kernel code of transposed-matrix-vector multiplication (TMV) PPoPP'2014

  14. Execution diagram Sequential section • Assume each thread block of baseline has 8 threads • Optimized kernel has 8*4 threads per thread block • 4 slave threads are used to process the parallel section. Parallel section Loop section Sequential section Master threads a) Execution time of baseline Sequential section Sequential section Slave threads b) Execution time of the optimized kernel PPoPP'2014

  15. Example after CUDA-NP • Introduce threads in Y dimension as slave threads • Process parallel section using multiple slave threads • Apply reduction after parallel section • Master thread is used for executing non-parallel section __global__ void tmv_np(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; intslave_id = threadIdx.y; for (inti= slave_id; i<h; i+=slave_size) sum += a[i*w+tx]*b[i]; sum = reduction(sum); if (salve_id==0) c[tx] = sum; } Kernel code of transposed-matrix-vector multiplica-tion (TMV) PPoPP'2014

  16. Slave threads organization 0 1 2 3 4 5 6 7 master thread id 8 9 10 11 12 13 14 15 slave thread id • Inter-warp nested parallelism • For a master thread, we allocate salve threads in different warps. • Master thread id 0: slave thread ids 0, 8, 16, 24. 16 17 18 19 20 21 22 23 slave thread id Sequential section 24 25 26 27 28 29 30 31 slave thread id Parallel section Sequential section Slave threads Inter-warp NP (warp size is 8) PPoPP'2014

  17. Slave threads organization 0 4 8 12 16 20 24 28 master thread id 1 5 9 13 17 21 25 29 slave thread id • Intra-warp nested parallelism • For a master thread, we allocate salve threads in same warp. • Master thread id 0: slave thread ids 0, 1, 2, 3. 2 6 10 14 18 22 26 30 slave thread id Sequential section 3 7 11 15 19 23 37 31 slave thread id Parallel section Sequential section Slave threads Intra-warp NP PPoPP'2014

  18. Variables across parallel sections • Scalar variables • Inputs/Live-Ins to Parallel Sections • Outputs/Live-Outs from Parallel Sections • Array variables • Inputs/Live-Ins to Parallel Sections • Outputs/Live-Outs from Parallel Sections PPoPP'2014

  19. Scalar variables • Inputs/Live-Ins to parallel sections • A scalar variable of master thread has to be broadcasted to its slave threads. • Intra-warp NP on Kepler • __shfl can be used to broadcast a scalar variable to its slave threads • Intra-warp NP on legacy hardware or Inter-warp NP • Shared memory • Scalar Outputs/Live-Outs from Parallel Sections • Reduction and scan variables • Intra-warp NP on Kepler • __shfl can be used • Intra-warp NP on legacy hardware or Inter-warp NP • Shared memory implementation PPoPP'2014

  20. Array structures across parallel sections • Global memory or shared memory • Visible for all slave threads • Local array (local memory or registers) • Replace local array with global memory • Replace local array with shared memory • Partition local array into small local array per slave thread PPoPP'2014

  21. Inter-Warp NP vs. Intra-Warp NP Only advantage of Intra-warp NP PPoPP'2014

  22. Outline • Background • Motivation • CUDA-NP • Experiments • Conclusions PPoPP'2014

  23. Experimental Results • NVIDIA GTX 680 GPU • CUDA SDK 5.0 • Benchmarks • NVIDIA SDK: MarchingCubes (MC) • GPGPUSim: Libor (LIB). • Rodinia: Lud(LU), Leukocyte (LE), Streamcluster (SS), Computational Fluid Dynamics (CFD), BucketSort (BK), and Nearest Neighbor (NN) • TMV and MV PPoPP'2014

  24. Best speedup over baseline • CUDA-NP can achieve from 1.36x to 6.69x speedups • On average CUDA-NP can achieve 2.18x speedup among the ten benchmarks PPoPP'2014

  25. Intra-warp NP vs inter-warp NP • Most benchmarks prefer inter-warp NP • LU has controldivergence in the baseline • NN prefer intra-warp NP due to un-coalesced memory accesses in the baseline PPoPP'2014

  26. Number of slave threads • More TLP is not always useful • Most benchmarks prefer 4 or 8 slave threads to achieve the best performance PPoPP'2014

  27. Performance comparison for TMV • CUBLAS 5.0 is a highly optimized library by NVIDIA • For 1K input, CUDA-NP version delivers 4.9x speedup over CUBLAS • CUDA-NP doesn’t hurt performance for large input sizes PPoPP'2014

  28. Benefit of shfl instruction • __shfl instruction is very useful for MC and LU to save shared memory usage • MC and LU use shared memory intensively PPoPP'2014

  29. Conclusions • Many benchmarks have nested parallelism with small loop counts • We propose CUDA-NP as a compiler framework to support directive-based nested parallelism • Our compiler explores both intra-warp NP and inter-warp NP, and handles live variables across code sections • 2.18x speedup on average PPoPP'2014

  30. Thanks PPoPP'2014

  31. Local array replacement PPoPP'2014

  32. Comparison • NVIDIA dynamic parallelism • NN, TMV, LE, LIB, and CFD, are 28.92, 7.61, 13.45, 125.67 and 52.29 times slower than baselines, respectively. • MC, LU, MV, SS, and BK are using shared memory • Require to copy data from shared memory to global memory to utilize the NVIDIA dynamic parallelism PPoPP'2014

  33. Experimental Methodology • NVIDIA K20c • Benchmarks PPoPP'2014