Decentralized Resource Management for Multi-core Desktop Grids

Decentralized Resource Management for Multi-core Desktop Grids Jaehwan Lee, Pete Keleher, Alan Sussman Department of Computer Science University of Maryland

Multi-core is not enough • Multi-core CPU is the current trend of desktop computing • Not easy to exploit Multi-core in a single machine for high throughput computing • “Multicore is Bad news for Supercomputers”, S. Moore, IEEE Spectrum, 2008 • No decentralized solution for Multi-core Grids

Challenges in Multi-core P2P Grids • Feature of Structured P2P Grids • For effective matchmaking, structured P2P platform based on Distributed Hash Table (DHT) is needed • Structured DHT is susceptible to frequent dynamic update for node’s status • How to represent a multi-core node in P2P structure ? • If a distinct logical peer represent s a core, • Cannot support multi-thread jobs • Cannot accommodate jobs requiring large shared resources • If a logical peer represents a multi-core machine, • Contention for shared resources among each cores • Can waste some of cores due to misled matchmaking • Needs to advertise dynamic status for residual resource • Contention for Shared Resources • No simple model for p2p grid

Our Contributions • Decentralized Resource Management Schemes for Multi-core Grids • Two logical nodes for a physical machine • Dual-CAN & Balloon Model for p2p structure • New Matchmaking & Load Balancing Scheme • Simple Analytic Model for a Multi-core Node • Contention for shared resources

Outline • Background • Decentralized Resource Management for Multi-core Grids • Simulation Model • Experimental Results • Related work • Conclusion & Future Work

P2P Desktop Grid System CPU Job J CPU 2.0GHz Mem  500MB Disk  1GB P2P Network Disk Memory Decentralized Matchmaking and Load balancing? Content-Addressable Network (CAN)

Owner Node Initiate Matchmaking Heartbeat Route Job J Send Job J Find Heartbeat Insert Job J Run Node Injection Node J FIFO Job Queue Overall System Architecture • P2P grids Job J Peer-to-Peer Network (DHT - CAN) Clients Assign GUID to Job J

Matchmaking Mechanism in CAN Memory A D G Run Node J Pushing Job J FIFO Queue B E H Client Heartbeat Job J CPU >= CJ && Memory >= MJ Job J Insert J C F I Owner MJ CJ CPU

Job 3 Job 1 Job 2 Job 4 CPU: 1.2GHz Mem:0.7GB CPU: 1.5GHz Mem:1.2GB CPU: 1.5GHz Mem:0.2GB CPU: 2GHz Mem:1.5GB Quad-core node CPU: 2GHz Mem: 4GB # of CPUs: 4 CPU: 2GHz Mem: 4GB # of CPUs: 4 CPU: 2GHz Mem: 4GB # of CPUs: 4 CPU: 2GHz Mem: 4GB # of CPUs: 4 CPU: 2GHz Mem: 4GB # free CPUs: 4 Free Node CPU: 2GHz Mem: 2.1GB # free CPUs: 2 CPU: 2GHz Mem: 3.3GB # free CPUs: 3 CPU: 2GHz Mem: 0.6GB # free CPUs: 1 Max-node Residue-node Two logical nodes • Max-node: the maximum values for each resource • A static point in the CAN (like a single-core case) • Residue-node: the current available resources • Dynamic usage status for the node • Always a freenode • If a node is free (has no job in the queue) or totally busy (all cores are running jobs), Residue-node does not exist : Residue-node is much fewer than Max-node

Job 1 CPU: 2GHz Mem:2GB Dual-CAN Memory B’ Free Residue- node 1GB • Primary CAN : composed of Max-nodes • The same as original CAN in single-core case (Static) • Secondary CAN : composed of Residue-nodes • Fewer nodes in Secondary CAN (dynamic) • Example • a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB) 2GHz CPU Memory Primary CAN Secondary CAN B A 3GB 2GB Qlen=1 Max-node 2GHz 1.5GHz CPU

Job 1 CPU: 2GHz Mem:2GB Balloon Model • Balloons : light-weight structure for residue-nodes • Only keeping the coordinates (current available resource) and load information • Attached to a zone in the (Primary) CAN • No CAN join & leave and exchanging updates is necessary • Example • a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB) Free Residue-node B’ 1.5GHz 2GHz CPU 1GB B A Qlen=1 2GB Max-node 3GB Memory

Computing Aggregated Load Memory Dimension Node A Node B Aggregation of load information (Number of Nodes, Balloons & Cores, Sum of used Cores) Node C Node D Node E CPU Dimension

Decision Algorithms - Pushing • Target Node (Where?) • Smaller aggregated average core utilization and Larger available number of cores • Stopping Criteria (When?) • Found a free node • Probabilistic stopping • Criteria for the Best Run Node (Which?) • Among the free nodes : the node with the fastest CPU • If no free nodes: the fastest Balloon or a node in Secondary CAN • Using a score function : prefer less core-utilization and faster CPU

Pushing : Dual-CAN Memory Dimension Memory R Aggregating load information Job J C D Run Node S MJ Stop! Client B Secondary CAN CJ CPU Job J CPU  CJ Mem  MJ MJ Aggregating load information A O Owner Job J Primary CAN CJ CPU Dimension

Pushing : Balloon Model Memory Dimension Aggregating load information Job J D S Run Node (Balloon) Stop! Client C Job J CPU  CJ Mem  MJ MJ Aggregating load information A B O Owner Job J CJ CPU Dimension

Model comparison

Contention for shared resources (the worst case) • Contention for shared resources (memory, I/O) can worsen overall performance in multi-core CPU • If the jobs are extremely memory intensive, the performance can drop drastically. • What is STREAM ? • The benchmark test to measure memory bandwidth • Generate extremely memory-intensive jobs (copy, scale, add and triad) • Experiments • (1) Run a memory-intensive job (STREAM) on a dual-core CPU (leave other core idle) • (2) Run two memory intensive jobs on a dual-core CPU simultaneously • Compare running times of (1) & (2) • On average, running time of (2) is longer than that of (1) by 2.09 times

Effect of Contention with general scientific computing jobs • Alam et al’s experiment • Run several benchmark test (NAS,AMBER,LAMMPS,POP) for scientific computing on a dual-core machine • Compare running a task on a dual-core and running two tasks on a dual-core • Running time for two tasks is higher by 3.8% to 27% (average : 10.97%) • SPEC CPU2006 Experiment • Do the same experiment with SPEC CPU2006 benchmark test on a dual-core machine • Running increment is 6% (with gcc compiler) and 10% (with icc compiler) on average

Our Simulation Model • Assumption • The job requiring more memory is likely more memory-intensive. • For the worst case • running time can increase by n times (n: the number of cores) • For the general case • running time can increase by p% (p=10 from previous experiments) Running time ratio α n xn=Ω 1 + p 1 Ri Ci α: running time increment n: the number of cores p : contention penalty from experiment result Ri : amount of Resource i in the node Ci : sum of job requirement for resource i Ω: contention penalty

Experimental Setup • Event-driven Simulations • A set of nodes and events • 1000 initial nodes and 5000 job submissions • Jobs are submitted with a Poisson distribution • A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution (30mins~90mins) • Node Capability (Job Requirements) • CPU, Memory, Disk and the number of cores • Steady stateexperiments

Experimental Setup • Performance metrics • Matchmaking Frameworks • CAN (Dual-CAN & Balloon Model) • Multiple Peers (MP) and Centralized Matchmaker (CENT) Matchmaking Cost Wait Time Running Time Job Turn-around Time Injected into the system Arrives at Owner Arrives at Run Node Starts Execution Finishes Execution

Comparison Models • CentralizedMatchmaker(CENT) • Online and global scheduling mechanism • Not feasible in a complete implementation of P2P system • Multiple Peers(MP) • An individual peer on each core with equally divided shared resources • Current Condor’s strategy Job J CPU 2.0GHz Mem  500MB Disk  1GB Centralized Matchmaker

Result (1) – Completeness • Single thread jobs • Dual-CAN & Balloon model can run all jobs • MP: 80% completeness • For fair comparison • Submit jobs which can be run on MP to Balloon or Dual-CAN (Balloon-L, Dual-CAN-L) • Balloon-L ,Dual-CAN-L,MP shows similar performance • MPcannot meet completeness

Cost (1) - Overheads • Cost: # of messages, volume of messages • MP is higher than the two schemes • Cost is proportional to the number of peers • Balloon model is cheaper than Dual-CAN

Result (2) – Load Balance • Multi-thread jobs • Load balancing Performance • Dual-CAN > CENT > Balloon model • Why CENT is worse? • CENT is based on a Greedy algorithm (over-provisioning)

Cost (2) - Overhead • Vanilla : The cost without additional costs incurred by Balloon or Dual-CAN • Costs • Dual-CAN > Balloon model == Vanilla

Evaluation Summary • Performance • Completeness : Dual-CAN, Balloon (MPcannot) • Load Balance : Dual-CAN >= Balloon == CENT (competitive load balance) • Overheads • MP >> Dual-CAN >= Balloon (the number of peers) • Dual-CAN >= Balloon == Vanilla (Low overhead)

Related Work • Time-to-Live (TTL) based mechanisms • Caromel et al. (Parallel Computing, 2007), Mastroianni et al (EGC, 2005). • Lack of Completeness • Encoding resource information using DHT • Cheema et al. (Grid, 2005), CompuP2P (TPDS, 2006) • Lack of Load balance and Parsimony • Grids for Multi-core Desktop • Condor : static partitioning to handle a multi-core node as a set of independent entities

Conclusion and Future Work • New decentralized Resource management for Multi-core P2P Grids • Two logical nodes for static & dynamic feature • Dual-CAN and Balloon Model • Simple analytic model for multi-core simulation considering resource contention • Evaluation via simulation • Completeness (better than Multiple Peers) • Load-balance (competitive with Centralized Matchmaker) • Low overhead • Future work • Real experiment (co-operation with Astronomy Dept.) • Resource Management for Heterogeneous Multi-processors

Decentralized Resource Management for Multi-core Desktop Grids Jaehwan Lee, Pete Keleher, Alan Sussman Department of Computer Science University of Maryland

Aggregated load information along the dimension d Objective function to minimize Stopping Factor Target Dimension Probability to stop pushing from node N Score function for a candidate run node C Decision Functions • Target Node (Where?) • Stopping Criteria (When?) • Found a free node OR • Criteria for the Best Run Node (Which?) • Among the free nodes OR

Decentralized Resource Management for Multi-core Desktop Grids

Decentralized Resource Management for Multi-core Desktop Grids

Presentation Transcript

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Resource and Test Management in Grids

Desktop Grids

Cost Aware Resource Management for Decentralized Network Services

Multi-Core Architectures and Shared Resource Management

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Resource and Test Management in Grids

Multi-Core Architectures and Shared Resource Management Lecture 1.2 : Cache Management

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Desktop Grids

System Level Resource Discovery and Management for Multi Core Environment

Generalized Resource Management In Computational Grids

Decentralized vs. Centralized Economic Coordination of Resource Allocation in Grids

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Desktop Grids

Resource and Test Management in Grids

Resource and Test Management in Grids

DECENTRALIZED TRUST MANAGEMENT