1 / 28

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters. Vignesh Ravi, Michela Becchi , Gagan Agrawal, Srimat Chakradhar. Context. GPUs are used in supercomputers Some of the top500 supercomputers use GPUs Tianhe-1A 14,336 Xeon X5670 processors 7,168 Nvidia Tesla M2050 GPUs

dora
Download Presentation

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, SrimatChakradhar

  2. Context • GPUs are used in supercomputers • Some of the top500 supercomputers use GPUs • Tianhe-1A • 14,336 Xeon X5670 processors • 7,168 Nvidia Tesla M2050 GPUs • Stampede • about 6,000 nodes: • Xeon E5-2680 8C, Intel Xeon Phi • GPUs are used in cloud computing • Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs

  3. Categories of Scheduling Objectives • Traditional schedulers for supercomputers aim to improve system-wide metrics: throughput& latency • A market-based service world is emerging: focus on provider’s profit and user’s satisfaction • Cloud: pay-as-you-go model • Amazon: different users (On-Demand, Free, Spot, …) • Recent resource managers for supercomputers (e.g. MOAB) have the notion of service-level agreement (SLA)

  4. Motivation State of the Art • Our Goal: • Reconsider market-based scheduling for heterogeneous clusters including GPUs • Open-source batch schedulers start to support GPUs • TORQUE, SLURM • Users’ guide mapping of jobs to heterogeneous nodes • Simple scheduling schemes (goals: throughput & latency) • Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs • [gViMHPCVirt '09][vCUDAIPDPS '09][rCUDAHPCS’10][gVirtuSEuro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12] • Simple scheduling schemes (goals: throughput & latency) • Proposals on market-based scheduling policies focus on homogeneous CPU clusters • [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04]

  5. Considerations • Community looking into code portability between CPU and GPU • OpenCL • PGI CUDA-x86 • MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC → Opportunity to flexibly schedulea job on CPU/GPU • In cloud environments oversubscription commonly used to reduce infrastructural costs → Use of resource sharing to improve performance by maximizing hardware utilization

  6. Problem Formulation • Given a CPU-GPU cluster • Schedule a set of jobs on the cluster • To maximize the provider’s profit / aggregate user satisfaction • Exploit the portability offered by OpenCL • Flexibly map the job on to either CPU or GPU • Maximize resource utilization • Allow sharing of multi-core CPU or GPU Assumptions/Limitations • 1 multi-core CPU and 1 GPU per node • Single-node, single GPU jobs • Only space-sharing, limited to two jobs per resource

  7. Value Function Market-based Scheduling Formulation • For each job, Linear-Decay Value Function [Irwin HPDC’04] • Max Value → Importance/Priority of job • Decay → Urgency of job • Delay due to: • queuing, execution on non-optimal resource, resource sharing Yield = maxValue – decay * delay Max Value Yield/Value Decay rate Execution time T

  8. Overall Scheduling Approach Scheduling Flow Jobs arrive in batches Jobs are enqueued on their optimal resource. Phase 1 is oblivious of other jobs (based on optimal walltime) Phase 1:Mapping Enqueue into CPU Queue Enqueue into GPU Queue Phase 2:Sorting Inter-jobs scheduling considerations Sort jobs to Improve Yield Sort jobs to Improve Yield Phase 3:Re-mapping Different schemes: - Whento remap? - Whatto remap? Execute on CPU Execute on GPU

  9. Phase 1: Mapping • Users provide walltimeon GPU and GPU • walltime used as indicator of optimal/non optimal resource • Each job is mapped onto its optimal resource NOTE: in our experiments we assumed maxValue = optimal walltime

  10. Phase 2: Sorting • Sort jobs based on Reward [Irwin HPDC’04] • Present Value – f(maxValuei, discount_rate) • Value after discounting the risk of running a job • The shorter the job, the lower the risk • Opportunity Cost • Degradation in value due to the selection of one among several alternatives

  11. Phase 3: Remapping • When to remap: • Uncoordinated schemes • queue is empty and resource is idle • Coordinated scheme • When CPU and GPU queues are imbalanced • What to remap: • Which job will have best reward on non-optimal resource? • Which job will suffer least reward penalty ?

  12. Phase 3: Uncoordinated Schemes • Last Optimal Reward (LOR) • Remap job with least reward on optimal resource • Idea: least reward → least risk in moving • First Non-Optimal Reward (FNOR) • Compute the reward job could produce on non-optimal resource • Remap job with highest reward on non-optimal resource • Idea: consider non-optimal penalty • Last Non-Optimal Reward Penalty (LNORP) • Remap job with least reward degradation RewardDegradationi= OptimalRewardi - NonOptimalRewardi

  13. Phase 3: Coordinated Scheme Coordinated Least Penalty (CORLP) • When to remap: imbalance between queues • Imbalance affected by: decay rates and execution times of jobs • Total Queuing-Delay Decay-Rate Product (TQDP) • Remap if |TQDPCPU – TQDPGPU| > threshold • What to remap • Remap job with least penalty degradation

  14. Heuristic for Sharing Resource Sharing Heuristic • Limitation: Two jobs can space-share of CPU/GPU • Factors affecting sharing - Slowdown incurred by jobs using half of a resource + More resource available for other jobs • Jobs • Categorized as low, medium, high scaling (based on models/profiling) • When to enable sharing • Large fraction of jobs in pending queues with negative yield • What jobs share a resource • Scalability-DecayRatefactor • Jobs grouped based on scalability • Within each group, jobs are ordered by decay rate (urgency) • Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)

  15. Overall System Prototype Master Node Compute Node Compute Node Compute Node …

  16. Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU Compute Node Compute Node Compute Node … Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU

  17. Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU

  18. Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU TCP Communicator Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … CPU Execution Processes GPU Execution Processes Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU OS-basedscheduling & sharing GPU Consolidation Framework

  19. Overall System Prototype Submission Queue Master Node Centralized decision making Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Execution & sharing mechanisms Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU TCP Communicator Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … CPU Execution Processes GPU Execution Processes Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU OS-basedscheduling & sharing GPU Consolidation Framework Assumption: shared file system

  20. GPU Sharing Framework GPU-related Node-Level Runtime CUDA app1 CUDA appN GPU execution processes (Front-End) … CUDA InterceptionLibrary CUDA InterceptionLibrary Front End – Back End Communication Channel GPU Consolidation Framework Back-End CUDA Runtime CUDA Driver GPU

  21. GPU Sharing Framework GPU-related Node-Level Runtime CUDA app1 CUDA appN GPU execution processes (Front-End) CUDA calls arrive from Frontend … CUDA InterceptionLibrary CUDA InterceptionLibrary Back-End Server Manipulates kernel configurations to allow GPUspace sharing Front End – Back End Communication Channel GPU Consolidation Framework Virtual Context Back-End Workload Consolidator CUDA Runtime CUDA Driver CUDA stream1 CUDA stream2 CUDA streamN GPU Simplified version of our HPDC’11 runtime

  22. Experimental Setup • 16-node cluster • CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory • GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory • 256-job workload • 10 benchmark programs • 3 configurations: small, large, very large datasets • Various application domains: scientific computations, financial analysis, data mining, machine learning • Baselines • TORQUE (always optimal resource) • Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99]

  23. Comparison with Torque-based Metrics Throughput & Latency 10-20% better ~ 20% better COMPLETION TIME AVERAGE LATENCY • Baselines suffer from idle resources • By privileging shorter jobs, our schemes reduce queuing delays

  24. Results with Average Yield Metric Yield: Effect of Job Mix up to 8.8x better up to 2.3x better Skewed-GPU Skewed-CPU Uniform • Better on skewed job mixes: • More idle time in case of baseline schemes • More room for dynamic mapping

  25. Results with Average Yield Metric Yield: Effect of Value Function up to 6.9x better up to 3.8x better • Adaptability of our schemes to different value functions

  26. Results with Average Yield Metric Yield: Effect of System Load up to 8.2x better • As load increases, yield from baselines decreases linearly • Proposed schemes achieve initially increased yield and then sustained yield

  27. Yield Improvements from Sharing Yield: Effect of Sharing up to 23x improvement Fraction of jobs to share • Careful space sharing can help performance by freeing resources • Excessive sharing can be detrimental to performance

  28. Summary Conclusion • Value-based Scheduling on CPU-GPU clusters • Goal: improve aggregate yield • Coordinated and uncoordinated scheduling schemes for dynamic mapping • Automatic space sharing of resources based on heuristics • Prototypical framework for evaluating the proposed schemes • Improvement over state-of-the-art • Based on completion time & latency • Based on average yield

More Related