Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Results Parallel computing is a necessity GPU is a promising example of heterogeneous hardware • Hundreds of cores • Memory bandwidth NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy Lots of expert tuning still necessary for: • Data transfer management • Multi-GPU applications • Dynamic task parallelism Background Crypt • Demonstrates low runtime overhead Nqueens • Static subtree assignment shows 2x as many tasks in one subtree as another Quicksort • Poor pivot selection results in unbalanced computation tree Dijstra Shortest Path Algorithm • Any new tasks must be spread across all other SMs Results Test each benchmark with multiple GPUs Diagnostic data allows us to measure how effectively tasks are being balanced between SMs Device: NVIDIA Tesla C2050 • 14 multiprocessor, 2.8 GB global memory, 1.15 GHz • CUDA 3.2 Host: 4 Quad Core AMD CPUs • 2.5 GHz Acknowledgements Approach Work stealing runtime which supports dynamic task parallelism, on hardware intended for data parallelism Showed effectiveness of device work stealing queues in distributing work between device SMs Our runtime incurs little overhead, and applications already optimized for CUDA demonstrate similar performance Provide a simpler runtime API to the device, though work remains to be done integrating the runtime with a true programming model Future Work Investigate more advanced memory management systems (reuse device memory, use constant memory for read-only data, etc) Saw copying data back from device took disproportionate amount of time in the runtime, compared to hand coded. We will investigate the cause Optimize device code by trying to limit cache-volatile operations and memory fences Consider using warps of threads rather than blocks of threads as execution unit Use worker blocks of different dimensions and allocate tasks to optimal worker blocks based on resources required by each kernel Fully integrate this runtime with the CnC-HC programming model being developed at Rice University Habanero Multicore Software Group. http://habanero.rice.edu CnC-CUDA: Declarative Programming for GPU’s. Max Grossman, AlinaSimion, ZoranBudimlić, VivekSarkar. 2010 Workshop on Languages and Compilers for Parallel Computing (LCPC), October 2010. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. Sanjay Chatterjee, Max Grossman, VivekSarkar. In Progress for LCPC 2011. GPU-Quicksort: A practical Quicksort algorithm for graphics processors. Cederman, Daniel and Tsigas, Philippas. J. Exp. Algorithmics. December 2009. Manage task execution on the GPU device using a hybrid work-stealing/work-sharing runtime • Host system places work for the GPU and in a work-sharing deque • Device consumes work from host and balances load across SMs by maintaining one work-stealing deque per SM, while dynamically creating new tasks on the device Hide device memory allocation and communication from user • Provide simpler runtime API for device memory management • Use asynchronous memcpy’s Manage multiple CUDA devices for the user, distributing tasks evenly across devices Provide domain expert with an abstraction of CUDA • CnC programming model: Max Grossman Rice University Advisor: Dr. VivekSarkar Conclusion <tag> (cpu_step) [out_item] [in_item] {gpu_step} {gpu_step} {gpu_step} {gpu_step} {gpu_step}

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Presentation Transcript

Parallelism in C++ using the Concurrency Runtime

Taming GPU compute with C++ Accelerated Massive Parallelism

Harnessing GPU compute with C++ Accelerated Massive Parallelism

Task Parallelism and Task Superscalar Processing

Instruction-Level Parallelism Dynamic Scheduling

An Adaptive Task Creation Strategy for Work-Stealing Scheduling

Stealing

Instruction-Level Parallelism dynamic scheduling

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework

Task Parallelism

Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Idempotent Work Stealing

‘Stealing’

Work Stealing Scheduler

A GPU Accelerated Storage System

MuPC: A Runtime System for UPC

GPU System Architecture

Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling

Task Based Execution of GPU Applications with Dynamic Data Dependencies

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Task Parallelism