1 / 40

KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism

KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. Izzat El Hajj (Illinois), Juan Gómez-Luna (Córdoba), Cheng Li (Illinois), Li-Wen Chang (Illinois), Dejan Milojicic (HPE), Wen- mei Hwu (Illinois). Host. Device.

mackm
Download Presentation

KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KLAP: Kernel Launch Aggregation and Promotionfor Optimizing Dynamic Parallelism Izzat El Hajj (Illinois), Juan Gómez-Luna (Córdoba), Cheng Li (Illinois), Li-Wen Chang (Illinois), DejanMilojicic (HPE), Wen-meiHwu (Illinois)

  2. Host Device Previously, kernels could only be launched from the host (painful to program!)

  3. Kernel Warp Dynamic Parallelism Block Thread Kernel Launch Kernels threads can launch new kernels on the device without host communication

  4. Host Device Easier to write programs with dynamically discovered parallelism

  5. Problems: Many kernels incur too much launch overhead Fine-grain kernels underutilize the GPU resources

  6. Proposed Solution: Kernel Launch Aggregation

  7. Kernel Warp Block Thread Kernel Launch Each thread can launch a separate kernel

  8. Kernel Warp Block Thread Kernel Launch Each thread can launch a separate kernel

  9. Kernel Warp Block Thread Kernel Launch Warp-Granularity Kernel Launch Aggregation

  10. Kernel Warp Block Thread Kernel Launch Warp-Granularity Kernel Launch Aggregation

  11. Kernel Warp Block Thread Kernel Launch Block-Granularity Kernel Launch Aggregation

  12. Kernel Warp Block Thread Kernel Launch Block-Granularity Kernel Launch Aggregation

  13. Kernel Warp Block Thread Kernel Launch --- launch child from host after parent terminates --- Kernel-Granularity Kernel Launch Aggregation

  14. Original Kernel Warp-GranularityKernel Launch Aggregation --- launch child from host after parent terminates --- Block-GranularityKernel Launch Aggregation Kernel-GranularityKernel Launch Aggregation

  15. kernel<<<gD,bD>>>(arg1,arg2,arg3) Original Kernel Call allocate arrays for args, gD, and bD store argsin arg arrays store gDin gD array, and bDin bD array new gD = sum of gD array across warp/block new bD = max of bD array across warp/block if(threadIdx == launcher thread in warp/block) { kernel_agg<<<new gD,newbD>>> (arg arrays, gD array, bD array) } = sum( ) = max( ) Transformed Kernel Call (block-granularity aggregation example)

  16. __global__ void kernel(params) { kernel body } Original Kernel __global__ void kernel_agg(param arrays, gD array, bD array) { calculate index of parent thread load params from param arrays load actual gridDim/blockDim from gD/bD arrays calculate actual blockIdx if(threadIdx < actual blockDim) { kernel body (with kernel launches transformed and with using actual gridDim/blockDim/blockIdx) } } # # = sum( ) = max( ) 0 0 1 0 1 1 3 0 Transformed Kernel (block-granularity aggregation example)

  17. Performance of Kernel Launch Aggregation 34.6 Kepler Maxwell Increasing aggregation granularity improves performance (geomean speedup of 6.58xfor K-aggregation on Kepler)

  18. Profiling of Kernel Launch Aggregation Performance improvement comes from reduced launch overhead and better resource utilization

  19. Prologue Problems: Launch overhead on critical path Limited depth of call-stack Launch Epilogue Launch Overhead Prologue Launch Epilogue Launch Overhead Prologue Launch … Epilogue

  20. Prologue Proposed Solution: Kernel Launch Promotion Launch Epilogue Launch Overhead Prologue Launch Epilogue Launch Overhead Prologue Launch … Epilogue

  21. Prologue Prologue Prologue Prologue Prologue Prologue Launch Launch Launch Launch Launch Launch Epilogue Epilogue Epilogue Epilogue Epilogue Epilogue Launch Overhead Launch Overhead Launch Overhead Launch Overhead … …

  22. Launch Launch Overhead Prologue Launch Launch Overhead Prologue Prologue Prologue Epilogue Launch … Prologue Launch Launch Launch Epilogue Epilogue Epilogue Launch Overhead Launch Overhead Epilogue Prologue Epilogue …

  23. Launch Launch Overhead Prologue Launch Launch Overhead Prologue Prologue Prologue Epilogue Prologue Launch … Launch Launch Launch Epilogue Epilogue Epilogue Launch Overhead Launch Overhead Epilogue Prologue Epilogue Promotion (P) removes the launch overhead from the critical path …

  24. Launch Launch Overhead Prologue Prologue Prologue Prologue Launch Launch Overhead Release Launch Launch Launch Epilogue Acquire Epilogue Epilogue Epilogue Launch Overhead Launch Overhead Prologue Launch … Release Epilogue Acquire Prologue Release … Epilogue Promotion (P) removes the launch overhead from the critical path …

  25. Launch Launch Overhead Prologue Launch Launch Overhead Release Epilogue Acquire Prologue Launch … Release Epilogue Acquire Prologue Release … Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  26. Launch Launch Launch Overhead Launch Overhead Prologue Prologue Launch Launch Launch Overhead Launch Overhead Release Release Epilogue Epilogue Acquire Acquire Prologue Prologue Launch Launch … … Release Release Epilogue Epilogue Acquire Acquire Prologue Prologue Release Release … … Epilogue Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  27. Launch Launch Overhead Prologue Launch Launch Launch Overhead Prologue Launch Overhead Release Epilogue Acquire Prologue Launch Launch … Launch Overhead Release Epilogue Acquire Prologue Release Launch … Epilogue Acquire Prologue Release Epilogue Acquire Prologue Release … Epilogue Release … Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  28. Launch Launch Overhead Prologue … Launch Launch Overhead Prologue Release Epilogue Acquire Prologue Launch Launch Overhead Release Epilogue Acquire Prologue Release Launch … Epilogue Acquire Prologue Release Epilogue Acquire Prologue Release … Epilogue Release … Epilogue Promotion with Aggregation (PA) reduces the number of launches, improving utilization, and increasing effective depth of the call stack Promotion (P) also enables two other optimizations: aggregation and overlap

  29. Launch Launch Overhead Prologue Launch Launch Overhead Release Epilogue Acquire Prologue Launch … Release Epilogue Acquire Prologue Release … Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  30. Launch Launch Launch Overhead Launch Overhead Prologue Prologue Launch Launch Launch Overhead Launch Overhead Release Release Epilogue Epilogue Acquire Acquire Prologue Prologue Launch Launch … … Release Release Epilogue Epilogue Acquire Acquire Prologue Prologue Release Release … … Epilogue Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  31. Launch Prologue (indep.) Prologue (indep.) Prologue (indep.) Launch Overhead Prologue Prologue (dep.) Prologue (dep.) Prologue (dep.) Launch Launch Launch Overhead Prologue Launch Overhead Release Epilogue Acquire Prologue Launch Launch … Launch Overhead Release Epilogue Acquire Prologue Release Launch … Epilogue Acquire Prologue Release Epilogue Acquire Prologue Release … Epilogue Release … Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  32. Launch Prologue (indep.) Launch Overhead Prologue (dep.) Launch Launch Launch Overhead Prologue Launch Overhead Release Epilogue Acquire Prologue (indep.) Launch Launch … Launch Overhead Release Prologue (dep.) Epilogue Acquire Prologue Release Launch … Epilogue Acquire Prologue (indep.) Release Prologue (dep.) Epilogue Acquire Prologue Release … Epilogue Release … Epilogue Promotion (P) also enables two other optimizations: aggregation and overlap

  33. Launch Prologue (indep.) Launch Overhead Prologue Prologue (dep.) Launch Launch Launch Overhead Prologue Launch Overhead Prologue (indep.) Release Epilogue Launch Launch … Acquire Prologue (indep.) Launch Overhead Release Prologue (dep.) Epilogue Acquire Prologue Release Launch … Epilogue Acquire Release Prologue (dep.) Epilogue Acquire Prologue Release … Epilogue Release … Epilogue Promotion with Overlap (PO) removes independent code between parent and child from the critical path Promotion (P) also enables two other optimizations: aggregation and overlap

  34. Launch Prologue (indep.) Launch Overhead Prologue Prologue (dep.) Launch Launch Launch Overhead Prologue Launch Overhead Prologue (indep.) Release Epilogue Acquire Launch Launch Prologue (dep.) … Prologue (indep.) Launch Overhead Release Epilogue Acquire Release Acquire Epilogue Prologue Prologue (dep.) Launch … Release … Epilogue Release Epilogue Acquire Prologue Release … Epilogue Promotion with Overlap (PO) removes independent code between parent and child from the critical path Promotion (P) also enables two other optimizations: aggregation and overlap

  35. Launch Launch Overhead Prologue (indep.) … Launch Launch Overhead Prologue Prologue (dep.) Release Prologue (indep.) Acquire Release Release … Launch Epilogue Prologue (indep.) Launch Overhead Acquire Release Prologue (dep.) Epilogue Acquire Prologue Launch Release … Epilogue Acquire Prologue (dep.) Release Release … Epilogue Acquire Epilogue Prologue Release … Epilogue Promotion with Aggregation and Overlap (PAO) achieves benefits of both PA and PO Promotion (P) also enables two other optimizations: aggregation and overlap

  36. L L L LO LO LO Pi Pi P … … L LO P Pd Pd R L P P P Pi A Pi LO R R R R … L E E E A Pi LO A A P R Pd Pd L … E A Pi L L L P E E E LO LO L … R R A E A E R Pd Pd E A P R R R … … E A E E P R … E R … E Original Kernel Promotion (P) Promotion with Aggregation (PA) Promotion with Overlap (PO) Promotion with Aggregation and Overlap (PAO) …

  37. Performance of Kernel Launch Promotion Kepler Maxwell

  38. Profiling of Kernel Launch Promotion Achieved Occupancy Instructions per Second Aggregation improves occupancy due to fewer coarser-grain kernels Overlap improves instructions per second due to more work parallel work available

  39. Summary • CUDA Dynamic Parallelism suffers when many fine-grain kernels incur high launch overhead and underutilize resources • Kernel launch aggregation reduces number of launches and improves utilization • For patterns with long dependence chains, kernel launch promotion enables aggregation and overlap to reduce launches, improve utilization, extract more parallelism, and extend call stack depth

  40. Thank you!KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Paralellism IzzatEl Hajj (Illinois), Juan Gómez-Luna (Córdoba), Cheng Li (Illinois), Li-Wen Chang (Illinois),DejanMilojicic (HPE), Wen-meiHwu (Illinois) Contact: elhajj2@Illinois.edu

More Related