Scheduling for Network Enabled Server Systems

Scheduling for Network Enabled Server Systems Frédéric Desprez LIP ENS Lyon, INRIA Rhône-Alpes, GRAAL project http://graal.ens-lyon.fr

Introduction • Grid: one vision for large scale computing over heterogeneous platforms coined by Ian Foster in the mid-90’s • Transparency and simplicity are the holy grail (maybe even before performance) ! • Scheduling tunability to take into account the characteristics of specific application classes • Several applications ready (and not only number crunching ones !) • Many incarnations of the Grid (metacomputing, cluster computing, global computing, peer-to-peer systems, Web Services, …) • Hundreds of research projects around the world • Significant technology base • Do not forget good ol’ time research on scheduling and distributed systems ! • Thanks Yves for remaining us this important fact along the last 14 years ! • Most scheduling problems are very difficult to solve even in their simplistic form

Stages of Grid scheduling Interface to Scheduler Price Policy Naming Policy Resource Information Service … … Performance policy Security Policy Resource Repository Resource Monitor Sensor Manager Agent Sensor Sensor Distributed Resources • Resource discovery • Authorization filtering, application requirement definition, minimal requirement filtering • System selection • Dynamic information gathering, system selection • Job execution • Advance reservation (opt), job submission, preparation tasks, monitoring progress, job completion, cleaning tasks Credit: Jennifer Schopf, Shifeng Zhang

Some examples of middleware architectures • Many environments provide efficient resource scheduling at various levels • GridSolve, APPLES, Condor, APST, (V)GrADS, WebComG, Nimrod-G, Ninf, NetSolve, APST, XtremWeb, CACTUS, LEGION, United Device, Platform, JiPang, SGE, OAR, Globus, Avaki, Entropia, BOINC, Javelin, Cactus, PBS Pro, Avaki, PUNCH, European Datagrid, Bond, 2K, MAUI, DIET, …

RPC and Grid-Computing: GridRPC • One simple idea • One simple (and efficient) paradigm for grid computing: offering (or leasing) computational power and/or storage capacity through the Internet • One simple solution: implementing the RPC programming model over the Grid • Using resources accessible through the network • Mixed parallelism model (data-parallel model at server level and task parallelism between servers) • Features needed • Load-balancing (resource localization and performance evaluation, scheduling), • IDL, • Data and replica management, • Security, • Fault-tolerance, • Interoperability with other systems, • … • Design of a standard interface • within the GGF (GridRPC WG and now SAGA WG) • Existing implementations: NetSolve, Ninf, DIET, XtremWeb, OmniRPC, …

RPC and Grid Computing: GridRPC Request S2 ! A, B, C Answer (C) AGENT(s) Client Op(C, A, B) S4 S3 S1 S2

DIET Architecture MA MA JXTA MA MA LA FAST library Application Modeling System availabilities LDAP NWS Client Master Agent MA Server front end LA LA LA Local Agent Release 2.1 available on the web

Requests Management bestServer = S3 FindServer() agent Aggregate() { min(…); } FindServer() runService(…); agent FindServer() Aggregate() { min(…); } server estimate() { predExecTime(…); }

Research Topics • Scheduling • Distributed scheduling • Software platform deployment with or without dynamic connections between components • Plug-in schedulers • Data-management • Scheduling of computation requests and links with data-management • Replication, data prefetching • Workflow scheduling • Performance evaluation • Application modeling • Dynamic information about the platform (network, clusters) • Applications • Bioinformatics, geology, physic, chemical engineering, sparse solvers evaluation, applied maths, …

Platform Deployment

Platform Deployment • Problem: mapping a middleware across many remote resources • Deployment phases • Resource discovery, planning, resource selection, remote file installation, pre-configuration, launch, post-configuration • And all this as automatically as possible • Our objective: find an optimal deployment of agents and servers onto a set of (dedicated) resources • An optimal deployment is a deployment that provides the maximum throughput •  is the throughput of the platform measured in number of requests completed per second

Deployment Management Distributed deployment of DIET DIETAdministration Traces GoDIET LogService TraceSubset • XML: • Resources • Machines • Storage • DIET hierarchy Trace subset VizDIET

Optimal Deployment Objective: find an optimal deployment of agents and servers for a set of resources V • Maximum throughput  of completed requests per second • maximizing the steady-state throughput (req/s) • Scheduling request throughput sched • Service throughput service Lemma 1: The completed request throughput of a deployment is given by the minimum of the scheduling request throughput (sched) and the service request throughput (service) = min(service, sched) Lemma 2: The scheduling throughput sched is limited by the throughput of the agent with the highest degree Lemma 3: The service request throughput service increases as the number of servers included in a deployment increases P.K. Chouhan, H. Dail, E. Caron, F. Vivien, Automatic Middleware Deployment Planning on Clusters, International Journal of High Performance Computing Applications (IJHPCA), to appear.

Complete Spanning D-ary Tree • A complete d-ary tree is a tree in which every level, except possibly the deepest, is completely filled. All internal nodes except one have a degree, or number of children, equal to d; the remaining internal node is at depth n-1 and may have any degree from 1 to d. • A spanning tree is a connected, acyclic subgraph containing all the vertices of a graph. • A complete spanning d-ary tree (CSD tree) is a tree that is both a complete d-ary tree and a spanning tree.

Optimal Deployment - Theorems • Theorem: the optimal throughput  of any deployment with maximum degree dMax is obtained with a CSD tree. • By Lemma 1 = min(service, sched) • By Lemma 2 sched is limited by agent with maximum degree • By Lemma 3 service increases with card(S) • Theorem:The complete spanning d-ary tree with degree that maximizes the minimum of the scheduling request and service request throughputs is an optimal deployment • Test all possible degrees • Select MAX min(service, sched)

Simulation and experimental design • Model parametrization • Experimentation setup • Software: GoDIET is used for the deployment • Job types: DGEMM, a simple matrix multiplication (BLAS package) • Workload: steady-state load with 1 - 200 client scripts (each script launches requests serially) • Resources: dual AMD Opteron, 246 processors @ 2GHz, each with cache size of 1024KB, 2GB of main memory and a 1Gb/s Ethernet • Lyon cluster - 55 nodes • Sophia cluster - 140 nodes

Throughput validation - DGEMM 1000, bandwidth 190 Mb/s

Throughput validation - DGEMM 310, 45 nodes

Summary • Determine how many nodes should be used and design hierarchical organization • Proved optimal deployment is a CSD tree • Algorithm to construct optimal tree • Deployment prediction is easy, fast and scalable • Experiments validate the model • Things currently addressed • Heuristics for heterogeneous platforms • Automatic generation of deployment scripts • Automatic redeployment

Plugin Schedulers

FAST: Fast Agent’s System Timer Client application FAST FAST API Static DataAcquisition Dynamic DataAcquisition LDAP BDB NWS ... Low level software • Performance evaluation of platform enables to find an efficient server (redistribution and computation costs) without testing every configuration  performance database for the scheduler • Based on NWS (Network Weather Service) • Computer • Memory amount • CPU Speed • Batch system • Computer • Status (up or down) • Load • Memory • Batch queue status • Network • Bandwidths • Latencies • Topology • Protocols • Network • Bandwidths • Latencies • Computation • Feasibility • Execution time on a given architecture

Plugin Schedulers • “First” version of DIET performance management • Each SeD answers a profile (COMP_TIME, COMM_TIME, TOTAL_TIME, AVAILABLE_MEMORY) for each request • Profile is filled by FAST • Local Agents sort the results by execution time and send them back up to the Master Agent • Limitations • Limited availability of FAST/NWS • Hard to install and configure • Priority of FAST-enabled servers • Extension hard to handle • Non-standard application- and platform-specific performance measures • Firewall problems with some performance evaluation tools • No use of integrated performance estimator (i.e. Ganglia)

DIET Scheduling • SeD level • Performance estimation function • Estimation Metric Vector (estVector_t) - dynamic collection of performance estimation values • Performance measures available through DIET • FAST-NWS performance metrics • Time elapsed since the last execution • CoRI (Collector of Resource Information) • Developer defined values • Standard estimation tags for accessing the fields of an estVector_t • EST_FREEMEM • EST_TCOMP • EST_TIMESINCELASTSOLVE • EST_FREECPU • Aggregation Methods • Defining mechanism how to sort SeD responses: associated with the service and defined at SeD level • Tunable comparison/aggregation routines for scheduling • Priority Scheduler • Performs pairwise server estimation comparisons returning a sorted list of server responses; • Can minimize or maximize based on SeD estimations and taking into consideration the order in which the request for those performance estimations was specified at SeD level.

DIET Scheduling • Collector of Resource Information (CoRI) • Interface to gather performance information • Functional requirements • Set of basic metrics • One single access interface • Non-functional requirements • Extensibility • Accuracy and latency • Non-Intrusiveness • Currently 2 modules available • CoRI Easy • Fast • Extension possibilities: Ganglia, NagiosR-GMA, Hawkeye, INCA, MDS, … CoRI Manager Other Collectors like Ganglia CoRI-Easy Collector FAST Collector FAST Software

CORI_EASY • Here easy means basic (/proc information) • Provided information: start printing CoRI values.. CPU 0 cache : 1024 Kb number of processors : 1 CPU 0 BogoMips : 5554.17 cpu average load : 0.56 free cpu : 0.2 disk speed in reading : 9.66665 Mbyte/s disks speed in writing : 3.38776 Mbyte/s total disk size : 7875.51 Mb available disk size :373.727 Mb total memory : 1011.86 Mb available memory : 22.5195 Mb end printing CoRI values

CPU Scheduler • Requests interleave time = 1 minute • RR scheduler

CPU Scheduler • Requests interleave time = 1 minute • CPU scheduler

CPU Scheduler • Requests interleave time = 5 seconds • Round Robin scheduler

CPU Scheduler • Request interleave time = 5 seconds • CPU scheduler

Large Scale Deployment over Grid’5000

Goals and Protocol of the Experiment Grid’5000 • Validation of the DIET architecture at large scale over different administrative domains • Protocol • DIET deployment over a maximum of processors • Large number of clients • Comparison of the DIET execution times with average local execution times • 1 MA, 8 LA, 540 SeDs • 2 requests/SeD • 1120 clients on 140 machines • DGEMM requests (2000x2000 matrices) • Simple round-robin scheduling usingtime_since_last_solve

Grid’5000 Results

Conclusion and Future Work

We did not talk about … • Distributed scheduling • P2P deployment of agents, distributed service discovery, application dependent scheduling, batch schedulers, workflow scheduling, join data and request scheduling • Adding services • Registering new applications • Performance evaluation • Routine/application cost, data (re)distribution, computation of the optimal number of processors used on the servers • Fault tolerance • Agent, servers, checkpointing • Security ! • Authentication, communications, firewalls, … • Applications • Simulation (physic, chemical eng., …), robotics, bioinformatics, geology, applied maths, sparse matrices expert site, …

GridRPC and DIET • GridRPC • Interesting approach for several applications • Simple, flexible, and efficient • Many interesting research issues (scheduling, data management, resource discovery and reservation, deployment, fault-tolerance, …) • DIET • Scalable, open-source, and multi-application platform • Concentration on several issues like resource discovery, scheduling (distributed scheduling, plugin schedulers, workflow management), deployment (GoDIET), performance evaluation (FAST and Freddy), monitoring (LogService and VizDIET), data management and replication (DTM and JuxMem) • Large scale validation on the Grid’5000 platform http://graal.ens-lyon.fr/DIET http://www.grid5000.org/

General Conclusions • Still room for fundamental research on algorithms (many ‘little scheduling problems’ to come) ! • Still (some) problems left • Finding accurate models • Large scale validation of algorithms (simulators, real grids ?) • Many efficient algorithms available in the literature (but what is their actual cost ?) • Adaptivity is mandatory ! • Take a look at applications outside the (small) numerical simulation market ! • Do we need application specific schedulers (and application specific middleware platforms) ? • Need of implementation and validation of algorithms from the scheduling literature in real-life middleware infrastructures • Use simple middleware platforms (like APST from UCSD) !

Questions ?

Scheduling for Network Enabled Server Systems

Scheduling for Network Enabled Server Systems

Presentation Transcript

Network Enabled Capability

Network Control Systems using Scheduling Strategies

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems

Scheduling in Server Farms

Network Enabled Capability (NEC)

Network-Enabled Business Models

Network-Enabled Business Models

OPERATING SYSTEMS SCHEDULING

Creating Network-Enabled Applications

Training for a Network Enabled Capability

Grid-Enabled Geospatial Systems

Energy-Efficient Mapping and Scheduling for DVS Enabled Distributed Embedded Systems

NETWORK SCHEDULING TECHNIQUES

Scheduling: Network Analysis

A Grid-enabled Multi-server Network Game Architecture

NextGen Network Enabled Weather

Operating Systems Scheduling

Appointment Scheduling Systems

MFO Scheduling Systems

Scheduling for Network Enabled Server Systems

GridSolve: A Network Enabled Solver