1 / 40

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Laboratories America).

newton
Download Presentation

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) MichelaBecchi (University of Missouri) GaganAgrawal (The Ohio State University) SrimatChakradhar (NEC Laboratories America)

  2. Two Interesting Trends BIG FIRST STEP! But at initial stages • GPU, “Big player” in High Performance Computing • Excellent “price-performance” and “performance-per-watt” ratio • Heterogeneous architectures – AMD Fusion APU, Intel Sandy Bridge, NVIDIA Denver Project • 3 out of top 4 super computers (Tianhe-1A, Nebulae, and Tsubame) • Emergence of Cloud – “Pay-as-you-go” model • Cluster instances , High-speed interconnects for HPC users • Amazon, Nimbix GPU instances

  3. Motivation • Sharing is the basis of cloud, GPU no exception • Multiple virtual machines may share a physical node • Modern GPUs are expensive than multi-core CPUs • Fermi cards with 6 GB memory, 4000 $ • Better resource utilization • Modern GPUs expose high degree of parallelism • Applications may not utilize full potential

  4. Related Work Enable GPU Visibility from Virtual Machines How to share GPUs from Virtual Machines? • CUDA Compute 2.0 + Supports Task Parallelism • Limitation: Only from Single Process Context • vCUDA (Shi et al.) • GViM (Gupta et al.) • gVirtuS (Guintaet al.) • rCuda (Duatoet al.)

  5. Contributions • A Framework for transparent GPU sharing in cloud • No source code changes required, feasible in cloud • Propose sharing through consolidation • Solution to conceptual consolidation problem • New method for computing consolidation affinity scores • Two new molding methods • Overall Runtime consolidation algorithm • Extensive evaluation with 8 benchmarks on 2 GPUs • At high contention, 50% improved throughput • Framework overheads are small

  6. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  7. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  8. GPU Architecture • CUDA Mapping and Scheduling BACKGROUND

  9. Background ... ... SM SM SM GPU Device Memory SH MEM SH MEM SH MEM Resource Requirements < Max Available  Inter-leaved execution Resource Requirements > Max Available  Serialized execution

  10. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  11. Demonstrate Potential of Consolidation • Relation between Utilization and Performance • Preliminary experiments with consolidation UNDERSTANDING CONSOLIDATION on GPU

  12. GPU Utilization vs Performance Scalability of Applications Good Improvement Sub-Linear No Significant Improvement Linear

  13. Consolidation with Spaceand Time Sharing SM SM SM SM SH MEM SH MEM SH MEM SH MEM App 1 App 2 Cannot utilize all SMs effectively Better Performance at large no. of blocks

  14. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  15. Challenges • gVirtuS Current Design • Consolidation Framework & its Components FRAMEWORK DESIGN

  16. Design Challenges Need a Virtual Process Context Enabling GPU Sharing Need Policies and Algorithms to decide When & What to Consolidate Light-Weight Design Overheads

  17. gVirtuS Current Design VM2 VM1 CUDA App2 CUDA App1 Guest Side Backend Process 2 Backend Process 1 Frontend Library Frontend Library Linux / VMM Guest-Host Communication Channel • Fork Process • No Communication b/w processes gVirtuS Backend Host Side CUDA Runtime CUDA Driver GPU1 GPUn …

  18. Runtime Consolidation Framework Workloads arrive from Frontend BackEnd Server Queues Workloads to Dispatcher Dispatcher HOST SIDE Consolidation Decision Maker Queues Workloads to Virtual Context Ready Queue Policies Heuristics Virtual Context Virtual Context Thread Workload Consolidator Workload Consolidator GPU GPU

  19. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  20. CONSOLIDATION DECISION MAKING LAYER • GPU Sharing Mechanisms & Resource Contention • Two Molding Policies • Consolidation Runtime Scheduling Algorithm

  21. Sharing Mechanisms & Resource Contention Sharing Mechanisms Consolidation by Space Sharing Consolidation by Time Sharing Large No. of Threads with in a block Basis of Affinity Score Resource Contention Pressure on Shared Memory

  22. Molding Kernel Configuration Perform molding dynamically Leverage gVirtuS to intercept kernel launch Flexible for configuration modification Mold the configuration to reduce contention Potential increase in application latency However, may still improve global throughput

  23. Two Molding Policies Molding Policies Time Sharing with Reduced Threads Forced Space Sharing 14 * 256 14 * 512 7 * 256 14 * 128 May resolve shared memory Contention May reduce register pressure in the SM

  24. Consolidation SchedulingAlgorithm Overall Algorithm Generate Pair-wise Affinity Generate Affinity for List Get Affinity By Molding Greedy-based Scheduling Algorithm Schedule “N” kernels on 2 GPUs Input: 3-Tuple Execution Configuration list of all kernels Data Structure: Work Queue for each Virtual Context

  25. Consolidation Scheduling Algorithm Create Work Queues for Virtual Contexts Generate Pair-wise Affinity Configuration list (a1, a2) = Generate Affinity For List for each rem. Kernel With each Work Queue Find the pair with min. affinity Split the pair into diff. Queues (a3, a4) = Get Affinity By Molding for each rem. Kernel With each Work Queue Find Max(a1, a2, a3, a4) Dispatch Queues into Virtual Contexts Push kernel into Queue

  26. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  27. EXPERIMENTAL RESULTS • Setup, Metric & Baselines • Benchmarks • Results

  28. Setup, Metric & Baselines • Setup • A Machine with Two Intel Quad core Xeon E5520 CPU • Two NVIDIA Tesla C2050 GPU Cards • 14 Streaming Multi Processors, each containing 32 cores • 3 GB Device Memory • 48 KB Shared Memory per SM • Virtualized with gVirtuS 2.0 • Evaluation Metric • Global Throughput benefit obtained after consolidation of kernels • Baselines • Serialized execution, based on CUDA Runtime Scheduling • Blind Round-Robin based consolidation (Unaware of exec. configuration)

  29. Benchmarks & Goals Benchmarks and its Characteristics

  30. Benefits of Space and Time Sharing Mechanisms • No resource contention • Consolidation through Blind Round-Robin algorithm • Compared against serialized execution of kernels Space Sharing Time Sharing

  31. Drawbacks of Blind Scheduling Presence of Resource Contentions Large Number of Threads Shared Memory Contention No benefit from Consolidation

  32. Effect of Molding Contention – Shared Memory Contention – Large Threads Time Sharing with Reduced Threads Forced Space Sharing

  33. Effect of Affinity Scores • Kernel Configurations • 2 kernels with 7*512 • 2 kernels with 14*256 • No affinity – Unbalanced Threads per SM • With affinity – Better Thread Balancing per SM

  34. Benefits at High Contention Scenario 8 Kernels on 2 GPUs 6 out of 8 Kernels molded 31.5% improvement over Blind Scheduling 50% over serialized execution

  35. Framework Overheads With Consolidation No Consolidation Compared with manually consolidated execution Overhead always less than 4% Compared to plain gVirtuS execution Overhead always less than 1%

  36. Outline • Background • Understanding Consolidation on GPU • Framework Design • Consolidation Decision Making Layer • Experimental Results • Conclusions

  37. Conclusions A Framework for transparent sharing of GPUs Use Consolidation as a mechanism for sharing GPUs No source code level changes New Affinity and Molding methods Runtime Consolidation Scheduling Algorithm At high contention, significant throughput benefits The overheads of the framework are small

  38. Thank You for your attention! Questions? AuthorsContact Information: • raviv@cse.ohio-state.edu • becchim@missouri.edu • agrawal@cse.ohio-state.edu • chak@nec-labs.com

  39. Impact of Large Number of Threads

  40. Per Application Slowdown/ Choice of Molding Application Slowdown Choice of Molding Type

More Related