Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids

Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids Aline Nascimento, Alexandre Sena, Cristina Boeres and Vinod Rebello Instituto de Computação Universidade Federal Fluminense, Niterói (RJ), Brazil http://easygrid.ic.uff.br e-mail: {depaula,asena,boeres,vinod}@ic.uff.br

Talk Outline • Introduction • Concepts and Related Work • The EasyGrid Application Management System • Hybrid Scheduling • Experimental Analysis • Conclusions and Future Work

Introduction • Grid computing has become increasingly widespread around the world • Growth in popularity means a larger number of applications will compete for limited resources • Efficient utilisation of the grid infrastructure will be essential to achieve good performance • Grid infrastructure is typically • Composed of diverse heterogeneous computational resources interconnected by networks of varying capacities • Shared, executing both grid and local user applications • Dynamic, resources may enter and leave without prior notice Hard to develop efficient grid management systems

Introduction • Much research is being invested in the development of specialised middleware responsible for • Resource discovery and controlling access • The efficient and successful execution of applications on the available resources • Three implementation philosophies can be defined • Resource management systems (RMS) • User management systems (UMS) • Application management systems (AMS)

Concepts • Resource Management System • Adopts a system-centric viewpoint • Centralised in order to manage a number of applications simultaneously • Aims to maximise system utilisation considering just the resource requirements of the applications not their internal characteristics • Scheduling middleware installed on single central server, monitoring middleware on grid nodes • User Management System • Adopts an application-centric viewpoint • One per application, collectively decentralised • Utilises the grid in accordance with resource availability and application’s topology (e.g. Bag of Tasks) • Installed on every machine from which an application can be launched

Concepts • Application Management System • Also adopts an application-centric viewpoint • One per application • Utilises the grid in accordance with both the available resources and the application’s characteristics • Is embedded into the application, offering better portability • Works in conjunction with simplified RMS • Transforms applications into system-aware versions • Each application can be made aware of their computational requirements and adjust itself according to the grid environment

Concepts • The EasyGrid AMS is • Hierarchically distributed within an application • Decentralised amongst various applications • Application specific • Designed for MPI applications • Automatically embedded into the application • EasyGrid simplifies the process of grid enabling existing MPI applications • The grid resources just need to offer core middleware services and a MPI communication library

Objectives • This paper focuses specifically on the problem of scheduling processes within the EasyGrid AMS • The AMS scheduling policies are application specific • This paper highlights the scheduling features through the execution of bag of tasks applications • The main goals are • To show the viability of the proposed scheduling strategy in the context of an Application Management System • To quantify the quality of the results obtainable

MPI Code Wrapper Application MPI Process Cluster based MPI Program Monitoring Messages Fault Tolerance AMS Management MPI Process Application Inter-process Communication Process Creation Dynamic Scheduling Monitoring Data Control Commands Application Monitoring Process Management The EasyGrid AMS Middleware • The EasyGrid framework is an AMS middleware for MPI implementations with dynamic process creation

The EasyGrid AMS Architecture • A three level hierarchical management system Site 4 SM SM GM Site 1 HM HM HM HM HM Computational Grid Site 3 SM Site 2 SM HM HM HM HM

The EasyGrid Portal • The EasyGrid Scheduling Portal is responsible for • Choosing the appropriate scheduling and fault tolerance policies for the application • Determining an initial process allocation • Compiling the system aware application • Managing the user’s grid proxy • Creating the MPI environment (including transferring files) • Providing fault tolerance for the GM process Acts like a simplified resource management system

The EasyGrid AMS • GM creates one SM per site • Each SM spawns a HM on each remaining resource at their respective site • The application processes will be created dynamically according to the local policy of the HM • Processes are created with unique communicators • Gives rise to the hierarchical AMS structure • Communication can only take place between parent and child processes • HMs, SMs and GMroute messages between application processes

Process Scheduling • Scheduling is made complex due to the dynamic and heterogeneous characteristics of grids • Predicting the processing power and communication bandwidth available to an application is difficult • Static schedulers • Estimates assumed a priori may be quite different at runtime • More sophisticated heuristics can be employed at compile time • Dynamic schedulers • Access to accurate runtime system information • Decisions need to be made quickly to minimise runtime intrusion Hybrid Schedulers Combine the advantages by integrating both static and dynamic schedulers

Static Scheduling • The scheduling heuristics require models that capture the relevant characteristics of both the • application (typically represented by a DAG) and • the target system • To define the parameters of the architectural model • The Portal executes a MPI modelling program with the user’s credentials to determine the current realistic performance available to this user’s application • At start-up, application processes will be allocated to the resources in accordance with the static schedule

Dynamic Scheduling • Modifying the initial allocation at run time is essential, given the difficulty of • Extracting precise information with regard to the application’s characteristics • Predicting the performance available from shared grid resources • The current version only reschedules processes which have yet to be created • The dynamic schedulers are associated with each of the management processes distributed in the 3 levels of the AMS hierarchy

GDS Site 3 Site 2 SDS SDS SDS HDS HDS HDS HDS HDS HDS HDS HDS HDS HDS Site 1 AMS Hierarchical Schedulers Associated with the GM Associated with a SM Associated with a HM

Dynamic Scheduling • The appropriate scheduling policies depend on • The class of the application and the user’s objectives • Different policies may be used in different layers of the hierarchy and even within the same layer • The dynamic schedulers collectively • Estimate the remaining execution time on each resource • Verify if the allocation needs to be adjusted • If necessary, activate the rescheduling mechanism • A rescheduling mechanism characterises a scheduling event

Host Dynamic Scheduler • HDS determines both the order and the instant that an application process should be created on the host • Possible scheduling polices to determine process sequence include • The order specified by the static scheduler • Data flow: selects any ready task • A second policy is necessary to indicate when the selected process may execute • The optimal number of application process that should execute concurrently depends on their I/O characteristics • Often influenced by local usage restrictions

The percentage resource utilisation Resource’s current computational power estimated remaining time cp cp cp ert ert ert Host Dynamic Scheduler - HDS • When an application process terminates on a resource, the monitor makes available to the HDS the process’s • wall clock time • CPU execution time together with the heterogeneity factor • Both cp and ert are added to the monitoring message and sent to the Site Manager SDS HDS HDS HDS

cp ert Site Dynamic Scheduler - SDS • On receiving the and the from each resource in the site, the SDS calculates • The ideal makespan for the remaining processes in the site • The site imbalance index • If the site imbalance index is above a predefined threshold, a scheduling event at the SDS is triggered • SDS requests a percentage of the remaining workloadfrom the most overloaded host • HDS send tasks to be rescheduled to the SDS SDS % • SDS distributes the tasks amongst the under loaded hosts • HDS receives the request and decides which tasks to release HDS HDS HDS

ert ert ert ert • Both cp and ert are added to the monitoring message and sent to the Global Manager cp cp cp cp Site Dynamic Scheduler • When not executing a scheduling event, SDS calculates • The average estimated remaining time of the site • The sum of computational power of site resources GDS SDS SDS SDS

ert cp Global Dynamic Scheduler - GDS • On receiving the and the from each site, GDS calculates • The ideal makespan for the whole application • The system imbalance index GDS % • GDS distributes the tasks between the under loaded sites • Each under loaded site reschedules the received tasks amongst their hosts • GDS requests to the most overloaded site a percentage of its remaining workload • SDS receives the request and forwards it to its HDSs • SDS waits for each HDS to answer and sends the list of tasks to the GDS SDS SDS % % % HDS HDS HDS HDS HDS

Experimental Analysis • These scheduling policies were designed for BoT applications such as parameter sweep (PSwp) • PSwp can be represented by simple fork-join DAGs • The scheduling policies for this class of DAG can be seen as load balancing strategies • Semi-controlled three site grid environment • Pentium IV 2.6 GHz processors with 512 Mb RAM • Running Linux Fedora Core 2, Globus Toolkit 2.4 and LAM/MPI 7.0.6

sn18 sn18 sn17 sn17 sn12 sn12 sn13 sn13 sn19 sn19 sn14 sn14 Site 1 Site 1 Site 1 Site 3 Site 3 Site 3 sn20 sn20 sn15 sn15 1 Gb/s sn21 sn21 100 Mb/s sn16 sn16 sn27 sn27 sn24 sn24 sinergia 100 Mb/s 100 Mb/s sn22 sn22 sn25 sn25 sn23 sn23 sn26 sn26 sn27 sn00 sn16 sinergia sinergia Site 2 Site 2 Site 2 PSwp1 PSwp2 1 Gb/s sn19 sn03 sn02 sn04 sn01 sn08 sn09 sn07 sn06 sn17 sn18 sn10 sn13 sn12 sn20 sn21 sn22 sn14 sn23 sn15 sn24 sn26 sn11 sn25 sn05 sn00 sn00 sn01 sn01 sn11 sn11 sn02 sn02 sn10 sn10 1Gb/s sn09 sn09 sn03 sn03 sn04 sn04 sn08 sn08 sn05 sn05 sn07 sn07 sn06 sn06 Experimental Analysis

Experimental Analysis • Same HDS policies for PSwp1 and PSwp2 • The overhead due to the AMS is very low, not exceeding 2.5% • The cost for rescheduling is also small, less than 1.2% • The standard deviation is less than the granularity of a single task

Experimental Analysis • Different HDS policies for PSwp1 and PSwp2 • PSwp1 processes execute only when resources are idle • The interference produced by PSwp1 is less than 0.8% • The cost for rescheduling not exceeds 1.4%

Experimental Analysis • EasyGrid AMS with different scheduling policies • Round robin used by MPI without dynamic scheduling • Near optimal static scheduling without dynamic scheduling • Round robin used by MPI + dynamic scheduling • Near optimal static scheduling + dynamic scheduling

Conclusion • In attempt to improve application performance a hierarchical hybrid scheduling is employed • The low cost of the hierarchical scheduling methodology leads to an efficient execution of BoT MPI applications • Different scheduling policies may be used in different levels of the scheduling hierarchy • One AMS per application permits that application specific scheduling policies be used • System awareness permits various applications to collaborate their scheduling efforts in a decentralised manner to obtain good system utilisation

Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids Thanks e-mail: {depaula,asena,boeres,vinod}@ic.uff.br

Alone • Together (Pswp2 have priority in shared resources) Number of Machines = 7 + 11 = 18 Number of Tasks = 1000  Optimal value = 56 * Duration of the Task tasks/machines = 1000/18  56 • Together (Pswp1 and Pswp2 with the same HDS polices) Number of Machines = 7 + 11 (shared) = 7 + 5.5 = 12.5 Number of Tasks = 1000  Optimal value = 80 * Duration of the Task tasks/machines = 1000/12.5  80 Calculation of the Optimal Value

Calculation of the Optimal Value • Together (Pswp1 uses only idle resources) When Pswp2 finishes, Pswp1: • Has already executed 56 tasks in each idle resource (7 machines) • (56 tasks * 7 machines)  392 tasks executed • And, remains (1000 – 392)  608 tasks, not yet executed in the shared resources • Executes 608 tasks in 18 machines (Sites 1, 2) • tasks/machines = 608/18  34  Optimal value = (56 + 34) * Duration of the Task

Calculation of the Expected Value • The expected value when the applications are executing together is based on the actual value when the application executed alone Example (same HDS policies): – actual value: 57.20 (executing alone in 18 machines) – expected value executing in (7 machines + 11 shared machines) • 57.20  18 machines • ????  12.5 machines (7 + 5.5) • Expected value = (57.20 * 18)/12.5 = 82.36

Calculation of the Expected Value • The expected value when the applications are executing together is based on the actual value when the application executed alone Example (different HDS policies): – actual value: 57.67 (executing alone in 18 machines) – at this moment is considered that Pswp2 finished its execution • So, Pswp1 has already executed 56 tasks in each idle resource (7 machines)  392 tasks executed • And remains (1000 – 392)  608 tasks, not yet executed in the shared resources • Pswp1 executes 608 tasks 18 machines (Site 1 and 2) • tasks/machines = 608/18  34 • Expected value = (57.67 + 34) = 91.67

Related Works • Accordingly to Buyya et al., the scheduling subsystem may be • Centralised • A single scheduler is responsible for grid-wide decisions • Not scalable • Hierarchical • Scheduling is divided into several levels, permitting different scheduling policies at each level • Failure of the top level scheduler, results in the entire system failure • Decentralised • Various decentralised schedulers communicate with each other, offering portability and fault tolerance • Good individual scheduling may not lead to a good overall performance

Related Work • OurGrid • Employs two different schedulers • A UMS job manager, with the responsibility for allocating application jobs on grid resources • A RMS site manager in charge of enforce particular site-specific policies • No information about the application is used • GridFlow • Is a RMS which focuses on service-level scheduling and workflow • Management takes place at 3 levels: global, local and resource

Related Work • GrADS • Is a RMS that utilises Autopilot to monitor the adherence to a performance contract between the application demands and resource capabilities • If the contract is violated, rescheduler takes corrective action by either: suspending execution, computing a new schedule, migrating process and then restarting or by process swapping • AppleS • Is an AMS where individual scheduling agents that are embedded into the application perform adaptive scheduling • Given the user’s goals, these centralised agents use applications characteristics and system information to select viable resources

Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids

Efficient Hierarchical Self-Scheduling for MPI Applications Executing in Computational Grids

Presentation Transcript

Online Scheduling in Grids

Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids

EFFICIENT DYNAMIC VERIFICATION ALGORITHMS FOR MPI APPLICATIONS

Federated Hierarchical Filter Grids

Distributed Asymmetric Verification in Computational Grids

Generalized Resource Management In Computational Grids

Computational grids and grids projects

Workflows and Scheduling in Grids

Computational Steering on Grids

Computational Steering on Grids

Self-executing Java, J2EE

DRM/Computational Grids

Optimised MPI for HPEC applications

Operating System Scheduling for Efficient Online Self-Test in Robust Systems

Hierarchical Scheduling

Grids and Computational Science

Hierarchical Scheduling Algorithms