Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) MichelaBecchi (University of Missouri) Wei Jiang (The Ohio State University) GaganAgrawal (The Ohio State University) SrimatChakradhar (NEC Research Laboratories)

Rise of Heterogeneous Architectures • Today’s High Performance Computing • Multi-core CPUs, Many-core GPUs are mainstream • Many-core GPUs offer • Excellent “price-performance”& “performance-per-watt” • Flavors of Heterogeneous computing • Multi-core CPUs + (GPUs/MICs) connected over PCI-E • Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge • Such hetero. platforms exist in: • 3 out 5 top Supercomputers, large clusters in acad., industry • Many cloud providers: Amazon, Nimbix, SoftLayer …

Motivation • Revisit Scheduling problems for CPU-GPU clusters • Exploit portability offered by models like OpenCL • Automatic mapping of jobs to resources • Desirable advanced scheduling considerations • Supercomputers and Cloud environments are typically “Shared” • Accelerate a set of applications as opposed to single application • Software Stack to program CPU-GPU Architectures • Combination of (Pthreads/OpenMP…) + (CUDA/Stream) • Now, OpenCL is becoming more popular • OpenCL, a device agnostic platform • Offers great flexibility with portable solutions • Write kernel once, execute on any device • Today’s schedulers (like TORQUE) for hetero. clusters: • DO NOT exploit the portability offered by OpenCL • User-guided Mapping of jobs to resources • Does not consider desirable scheduling possibilities (using CPU+GPU)

Outline • Problem Formulation • Challenges and Solution Approach • Scheduling of Single-Node, Single-Resource Jobs • Scheduling of Multi-node, Multi-Resource Jobs • Experimental Results • Conclusions

Problem Formulations Problem Goal: • Accelerate a set of applications on CPU-GPU cluster • Each node has two resources: A Multi-core CPU and a GPU • Map applications to resources to: • Maximize overall system throughput • Minimize application latency Scheduling Formulations: 1) Single-Node, Single-Resource Allocation & Scheduling 2) Multi-Node, Multi-Resource Allocation & Scheduling

Scheduling Formulations Single-Node, Single-Resource Allocation & Scheduling Multi-Node, Multi-Resource Allocation & Scheduling • In addition, allows CPU+GPU allocation • Desirable in future to allow flexibility in acceleration of applications • In addition, allows multiple node allocation per job • MATE-CG [IPDPS’12], a framework for Map-Reduce class of apps. allows such implementations • Allocates a multi-core CPU or a GPU from a node in cluster • Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-node apps. • Limited mechanisms to exploit CPU+GPU simultaneously • Exploit the portability offered by OpenCLprog. Model

Challenges and Solution Approach Decision Making Challenges: • Allocate/Map to CPU-only, GPU-only, or CPU+GPU? • Wait for optimal resource (involves queuing delay) • Assign to non-optimal resource (involves penalty) • Always allocating CPU+GPU  may affect global throughput • Should consider other possibilities like CPU-only or GPU-only • Always allocate requested # of nodes? • May increase wait time, can consider allocation of lesser nodes Solution Approach: • Take different levels of user inputs (relative speedups, execution times…) • Design scheduling schemes for each scheduling formulation

Scheduling Schemes for First Formulation Two Input Categories & Three Schemes: • Categories are based on the amount of input expected from the user • Category 1: Relative Multi-core (MP) and GPU (GP) performance as input • Scheme1: Relative Speedup based w/ Aggressive Option (RSA) • Scheme2: Relative Speedup based w/ Conservative Option (RSC) • Category 2: Additionally, sequential CPU exec. Time (SQ) • Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC) Takes multi-core and GPU speedup as input N Jobs, MP[n], GP[n] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) • Create CPU/GPU queues • Map jobs to optimal resource queue Sort CJQ and GJQ in Desc. Order R=GetNextResourceAvialable() IsGPU Yes No GJQ Empty? Assign GJQtop to R Yes Aggressive, minimizes penalty Aggressive? Conservative Yes No Wait for CPU Assign CJQbottomto R

Adaptive Shortest Job First (ASJF) N Jobs, MP[n], GP[n], SQ[N] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) Minimize latency for short jobs Sort CJQ and GJQ in Asc. Order of SQ R=GetNextResourceAvialable() IsGPU Yes No GJQ Empty? Assign GJQtop to R Yes Automatic switch for aggressive or conservative option T1= GetMinWaitTimeForNextCPU() T2k= GetJobWithMinPenOnGPU(CJQ) Yes No Wait for CPU to become free or for GPU jobs T1 > T2k Assign CJQkto R

Scheduling Scheme for Second Formulation Solution Approach: • Flexibly schedule on CPU-only, GPU-only, or CPU+GPU • Molding the # of nodes requested by job • Consider allocating ½ or ¼th of requested nodes • Inputs from User: • Execution times of CPU-only, GPU-only, CPU+GPU • Execution times of jobs with n, n/2, n/4 nodes • Such app. Information can also be obtained from profiles

Flexible Moldable Scheduling Scheme (FMS) N Jobs, Exec. Times… Minimize resource fragmentation Group Jobs with # of Nodes as the Index Helps co-locate CPU and GPU job on the same node Sort each group based on exec. time of CPU+GPU version Gives global view to co-locate on same node Pick a pair of jobs to schedule in order of sorting Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each job Choose same resource for both jobs (C,C) (G,G) (CG,CG) Choose C for one job & G for the other 2N Nodes Avail? Co-locate jobs on same set of nodes No Schedule first job on N nodes Yes Consider Molding # of nodes for the next job Schedule pair of jobs in parallel on 2N nodes

Cluster Hardware Setup • Cluster of 16 CPU-GPU nodes • Each CPU is 8 core Intel Xeon E5520 (2.27GHz) • Each GPU is an Nvidia Tesla C2050 (1.15 GHz) • CPU Main Memory – 48 GB • GPU Device Memory – 3 GB • Machines are connected through Infiniband

Benchmarks Single-Node Jobs • We use 10 benchmarks • Scientific, Financial, Datamining, Image Processing applications • Run each benchmark with 3 different exec. Configurations • Overall, a pool of 30 jobs • Multi-Node Jobs • We use 3 applications • Gridding kernel, Expectation-Maximization, PageRank • Applications run with 2 different datasets and on 3 different node numbers • Overall, a pool of 18 jobs

Baselines & Metrics Baseline for Single-Node Jobs • Blind Round Robin (BRR) • Manual Optimal (Exhaustive search, Upper Bound) • Baseline for Multi-Node Jobs • TORQUE, a widely used resource manager for hetero. clusters • Minimum Completion Time (MCT), [Maheswaranet.al, HCW’99] • Metrics • Completion Time (Comp. Time) • Application Latency: • Non-optimal Assignment (Ave. NOA. Lat) • Queuing Delay (Ave. QD Lat.) • Maximum Idle Time (Max. Idle Time)

Single-Node Job Results Uniform CPU-GPU Job Mix For each metric • 24 Jobs on 2 Nodes Proposed schemes • 108% better than BRR • Within 12% of Manual Optimal • Tradeoff between non-optimal penalty vs wait-time for resource • BRR has the highest latency • RSA, non-optimal penalty • RSC, high Queue delay • ASF as good as Manual optimal 4 different metrics CPU-biased Job Mix • BRR, very high idle times • RSC, can be very high too • RSA has the best utilization among proposed schemes

Multi-Node Job Results Varying Job Execution Lengths Proposed schemes • 32 Jobs on 16 Nodes • FMS, 42% better than best of Torque or MCT • Each type of molding gives reasonable improvement • Our schemes utilizes the resource betterhigh throughput • Intelligent on deciding to wait for res. or mold it for smaller res. Short Job (SJ), Long Job (LJ) Varying Resource Request Size • FMS, 32% better than best of Torque or MCT • Benefit from ResType Molding is better than NumNodes Molding Small Request (SJ), Large Request (LJ)

Conclusions • Revisit scheduling problems on CPU-GPU clusters • Goal to improve aggregate throughput • Single-node, single-resource scheduling problem • Multi-node, multi-resource scheduling problem • Developed novel scheduling schemes • Exploit portability offered by OpenCL • Automatic mapping of jobs to hetero. resources • RSA, RSC, and ASJF for single-node jobs • Flexible Molding Scheduling (FMS) for multi-node jobs • Significant improvement over state-of-the-art

Thank You! Questions? raviv@cse.ohio-state.edu becchim@missouri.edu jiangwei@cse.ohio-state.edu agrawal@cse.ohio-state.edu chak@nec-labs.com

Benchmarks – Large Dataset

Benchmarks – Small Dataset

Benchmarks – Large No. of Iterations

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Presentation Transcript

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

A Discussion of CPU vs. GPU

CPU Scheduling

CPU Scheduling

CPU Scheduling

GPU vs. CPU

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU scheduling

CPU Scheduling

CPU Scheduling