1 / 44

Task Management for Irregular-Parallel Workloads on the GPU

Task Management for Irregular-Parallel Workloads on the GPU. Stanley Tzeng , Anjul Patney , and John D. Owens University of California, Davis. Introduction – Parallelism in Graphics Hardware. Regular Workloads. Irregular Workloads. Good matching of input size to output size. Input Queue.

zaria
Download Presentation

Task Management for Irregular-Parallel Workloads on the GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Task Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, AnjulPatney, and John D. Owens University of California, Davis

  2. Introduction – Parallelism in Graphics Hardware Regular Workloads Irregular Workloads Good matching of input size to output size Input Queue Input Queue Hard to estimate output given input Process Process Output Queue Output Queue Vertex, Fragment Processor, etc. Geometry Processor, Recursion Unprocessed Work Unit Processed Work Unit

  3. Motivation – Programmable Pipelines • Increased programmability on GPUs allows different programmable pipelines on the GPU. • We want to explore how pipelines can be efficiently mapped onto the GPU. • What if your pipeline has irregular stages ? • How should data between pipeline stages be stored ? • What about load balancing across all parallel units ? • What if your pipeline is more geared towards task parallelism rather than data parallelism? Our paper addresses these Issues!

  4. In Other Words… • Imagine that these pipeline stages were actually bricks. • Then we are providing the mortar between the bricks. Us Pipeline Stages

  5. Related Work • Alternative pipelines on the GPU: • Renderants [Zhou et al. 2009] • Freepipe [Liu et al. 2009] • Optix [NVIDIA 2010] • Distributed Queuing on the GPU: • GPU Dynamic Load Balancing [Cederman et al. 2008] • Multi-CPU work • Reyes on the GPU: • Subdivision [Patney et al. 2008] • Diagsplit [Fisher et al. 2009] • MicropolygonRasterization [Fatahalian et al. 2009]

  6. Ingredients for Mortar Questions that we need to address: What is the proper granularity for tasks? Warp Size Work Granularity How many threads to launch? Persistent Threads How to avoid global synchronizations? Uberkernels How to distribute tasks evenly? Task Donation

  7. Warp Size Work Granularity • Problem: We want to emulate task level parallelism on the GPU without loss in efficiency. • Solution: we choose block sizes of 32 threads / block. • Removes messy synchronization barriers. • Can view each block as a MIMD thread. We call these blocks processors P P P P P P P P P Physical Processor Physical Processor Physical Processor

  8. Uberkernel Processor Utilization • Problem: Want to eliminate global kernel barriers for better processor utilization • Uberkernels pack multiple execution routes into one kernel. Data Flow Data Flow Pipeline Stage 1 Uberkernel Stage 1 Pipeline Stage 2 Stage 2

  9. Persistent Thread Scheduler Emulation Life of a Thread: • Problem: If input is irregular? How many threads do we launch? • Launch enough to fill the GPU, and keep them alive so they keep fetching work. Spawn Fetch Data Process Data Write Output Death Life of a Persistent Thread: How do we know when to stop? When there is no more work left

  10. Memory Management System • Problem: We need to ensure that our processors are constantly working and not idle. • Solution: Design a software memory management system. • How each processor fetches work is based on our queuing strategy. • We look at 4 strategies: • Block Queues • Distributed Queues • Task Stealing • Task Donation

  11. A Word About Locks • To obtain exclusive access to a queue each queue has a lock. • Current implementation uses spin locks and are very slow on GPUs. • We want to use as few locks as possible. while (atomicCAS(lock, 0,1) ==1);

  12. Block Queuing • 1 dequeue for all processors. Read from one end write back to the other. Writing back to the queue is serial. Each processor obtains lock before writing. Until the last element, fetching work from the queue requires no lock P1 P2 P3 P4 Excellent Load Balancing Read Write Horrible Lock Contention

  13. Distributed Queuing • Each processor has its own dequeue (called a bin) and it reads and writes to it. P1 P2 P3 P4 This processor finished all its work and can only idle Eliminates Locks Idle Processors

  14. Task Stealing • Using the distributed queuing scheme, but now processors can steal work from another bin. P1 P2 P3 P4 This processor finished all its work and steals from neighboring processor Very Low Idle Big Bin Sizes

  15. Task Donation • When a bin is full, processor can give work to someone else. P1 P2 P3 P4 This processor ‘s bin is full and donates its work to someone else’s bin Smaller memory usage More complicated

  16. Evaluating the Queues • Main measure to compare: • How many iterations the processor is idle due to lock contention or waiting for other processors to finish. • We use a synthetic work generator to precisely control the conditions.

  17. Average Idle Iterations Per Processor Lock Contention Idle Iterations Idle Waiting 25000 About the same performance About the same performance 200 Block Queue Dist. Queue Task Stealing Task Donation

  18. How it All Fits Together Spawn Global Memory Fetch Data Process Data Write Output Death

  19. Our Version Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  20. Our Version We spawn 32 threads per thread group / block in our grid. These are our processors. Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  21. Our Version Each processor grabs work to process. Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  22. Our Version Uberkerneldecies how to process the current work unit Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  23. Our Version Once work is processed, thread execution returns to fetching more work Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  24. Our Version When there is no work left in the queue, the threads retire Spawn Task Donation Memory System Fetch Data Uberkernels Persistent Threads Write Output Death

  25. Application: Reyes

  26. Pipeline Overview Start with smooth surfaces Obtain micropolygons • Scene Subdivision / Tessellation Shading Rasterization / Sampling Composition and Filtering • Image

  27. Pipeline Overview Shade micropolygons • Scene Subdivision / Tessellation Shading Rasterization / Sampling Composition and Filtering • Image

  28. Pipeline Overview Map micropolygons to screen space • Scene Subdivision / Tessellation Shading Rasterization / Sampling Composition and Filtering • Image

  29. Pipeline Overview Reconstruct pixels from obtained samples • Scene Subdivision / Tessellation Shading Rasterization / Sampling Composition and Filtering • Image

  30. Pipeline Overview • Scene Subdivision / Tessellation Irregular Input and Output Shading Regular Input and Output Regular Input IrregularOutput Rasterization / Sampling Irregular Input Regular Output Composition and Filtering • Image

  31. Split and Dice • We combine the patch split and dice stage into one kernel. • Bins are loaded with initial patches. • 1 processor works on 1 patch at a time. Processor can write back split patches into bins. • Output is a buffer of micropolygons

  32. Split and Dice • 32 Threads on 16 CPs – 16 threads each work in u and v • Calculate u and v thresholds, and then go to uberkernel branch decision: • Branch 1 splits the patch again • Branch 2 dices the patch into micropolygons Bin 0 Split Bin 0 µpoly buffer Dice

  33. Sampling • Stamp out samples for each micropolygon. 1 processor per micropolygon patch. • Since only output is irregular, use a block queue. • Write out to a sample buffer. Q: Why are you using a block queue? Didn’t you just show block queues have high contention when processors write back into it? µpolygons queue Read & Stamp A: True. However, we’re not writing into this queue! We’re just reading from it, so there is no writeback idle time. processor P1 P2 P3 P4 Atomic Add counter 0 15000 Writeout Global Memory

  34. Smooth Surfaces, High Detail 16 samples per pixel >15 frames per second on GeForce GTX280

  35. What’s Next • What other (and better) abstractions are there for programmable pipelines? • How is future GPU design going to affect software schedulers? • For Reyes: What is the right model to do GPU real time micropolygon shading?

  36. Acknowledgments • Matt Pharr, Aaron Lefohn, and Mike Houston • Jonathan Ragan-Kelley • ShubhoSengupta • Bay Raitt and Headus Inc. for models • National Science Foundation • SciDAC • NVIDIA Graduate Fellowship

  37. Thank You

  38. Micropolygons • 1x1 pixel quads • Resolution-independent! • Defined in world space • Allow detailed shading • Similar to fragments • Unit of Reyes Rendering

  39. Not in final presentation Q: Why are you using a block queue? Didn’t you just show block queues have high contention when processors write back into it? A: True. However, we’re not writing into this queue! We’re just reading from it, so there is no writeback idle time.

  40. Not in Final Presentation

  41. Not in Final Presentation

  42. The Reyes Architecture • Introduced in the 80s • High-quality rendering, used in movies • Designed for parallel hardware • Why is this interesting? • Well-studied • Well-behaved

  43. Defining characteristic: High Detail Image courtesy: Pixar Animation Studios

  44. Results Task Stealing Task Donating

More Related