1 / 9

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. Max Grossman Advisor: Dr. Vivek Sarkar Rice University. Background. GPU is a promising example of heterogeneous hardware Hundreds of simultaneous threads High m emory bandwidth

burton
Download Presentation

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Max Grossman Advisor: Dr. VivekSarkar Rice University

  2. Background • GPU is a promising example of heterogeneous hardware • Hundreds of simultaneous threads • High memory bandwidth • NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy for the average programmer

  3. Co Ca Control ALU ALU ALU ALU DRAM Cache A A A A A A A A A A A A A A A A Streaming Multiprocessor DRAM CPUs and GPUs have fundamentally different design philosophies Single CPU core Multiple GPU processors • Figure source: David B. Kirk and Wen-meiW. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.

  4. Motivation & Approach • CUDA programming model launches a batch of data-parallel threads • Can we do better with dynamic task parallelism? • Our approach • Manage task execution across multiple streaming multiprocessors (SMs) in a GPU device by introducing a hybrid work-stealing/work-sharing runtime system • Manage multiple CUDA devices for the user • Hide device memory allocation and communication from user

  5. Load Balance Results • NQueens(14) • Worst case load imbalance for static subtree assignment vs. dynamic work-stealing are 9.8x vs. 1.17x

  6. Performance Results: NQueens

  7. Performance Results: Crypt

  8. Conclusions • GPU work stealing runtime which supports dynamic task parallelism, on hardware intended for data parallelism • Showed effectiveness of work stealing queues in dynamically distributing work between SMs • Future work: • Fully integrate this runtime with the CnC-HC data flow coordination language being developed at Rice University

More Related