1 / 19

Task Based Execution of GPU Applications with Dynamic Data Dependencies

Task Based Execution of GPU Applications with Dynamic Data Dependencies. Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta. GP-GPU Computing. GPUs enable high throughput data & compute intensive computations Data is partitioned into a grid of “Thread Blocks” (TBs)

Download Presentation

Task Based Execution of GPU Applications with Dynamic Data Dependencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta

  2. GP-GPU Computing • GPUs enable high throughput data & compute intensive computations • Data is partitioned into a grid of “Thread Blocks” (TBs) • Thousands of TBs in a grid can be executed in any order • No HW support for efficient inter-TB communication • High scalability & throughput for independent data • Challenging & inefficient for inter-TB dependent data

  3. The Problem • Data-dependent & irregular applications • Simulations (n-body, heat) • Graph algorithms (BFS, SSSP) • Inter-TB synchronization • Sync through global memory • Irregular task graphs • Static partitioning fails • Heterogeneous execution • Unbalanced distribution ! ! ! DataDependency Graph

  4. The Solution • “Task based execution” • Transition from SIMD -> MIMD

  5. Challenges Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing

  6. Proposed Task Based Execution Framework Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies

  7. Overview (4). Output (3). Retrieve & Execute (2). Queue (5). Resolve Dependencies (1). Grab a ready Task (6). Grab new

  8. Concurrent Worker&Scheduler Worker Scheduler

  9. Queue Access &Dependency Tracking IQS and OQS Efficient signaling mechanism via global memory Parallel task pointer retrieval Queues store pointers to tasks Parallel dependency check

  10. Queue Insertion Policy Round robin: Better load balancing Poor cache locality Tail submit: [J. Hoogerbrugge et al.]: First child task is always processed by the same SM with parent. Increased locality t t+2 t+1

  11. API user_task is called by worker_kernel Application specific data is added under WorkerContext and Task

  12. Experimental Results NVIDIA Tesla 2050 14 SMs, 3GB memory Applications: Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison: Central queue vs. distributed queue

  13. Applications Heat 2D: Regular dependencies, wavefront parallelism. Each tile is a task, intra-tile and inter-tile parallelism

  14. Applications BFS: Irregular dependencies. Unreached neighbors of a node forms a task

  15. Runtime

  16. Scalability

  17. Future Work S/W support for: Better task representation More task insertion policies Automated task graph partitioning for higher SM utilization.

  18. Future Work H/W support for: Fast inter-TB sync Support for TB to SM affinity “Sleep” support for TBs

  19. Conclusion Transition from SIMD -> MIMD Task-based execution model Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support

More Related