High Performance Cluster Computing Architectures and Systems

High Performance Cluster ComputingArchitectures and Systems Hai Jin Internet and Cluster Computing Center

Scheduling Parallel Jobs on Clusters • Introduction • Background • Rigid Jobs with Process Migration • Malleable Jobs with Dynamic Parallelism • Communication-Based Coscheduling • Batch Scheduling • Summary

Introduction (I) • Clusters are increasingly being used for HPC application • High cost of MPPs • Wide availability of networked workstations and PCs • How to add the HPC workload • Original general-purpose workload on the cluster • Not degrading the service of the original workload

Introduction (II) • The issues to support for HPC applications • The acquisition of resources • how to distinguish between workstations that are in active use and those that have available spare resources • The requirement to give priority to workstation owners • not cause noticeable degradation to their work • The requirement to support the different styles of parallel programs • place different constraints on the scheduling of their processes • Possible use of admission control and scheduling policies • regulate the additional HPC workload • These issues are interdependent

Background (I) • Cluster Usage Modes • NOW (Network Of Workstations) • Based on tapping the idle cycles of existing resource • Each machines has an owner • Berkeley, Condor, and MOSIX • When the owner is inactive, the resources become available for general use • PMMPP (Poor Man’s MPP) • Dedicated cluster acquired for running HPC application • Less constraints regarding interplay between the regular workload and the HPC workload • Beowulf project, RWC PC cluster, and ParPar • Concentrate on scheduling in a NOW environment

Background (II) • Job Types and Requirements • Job structure and the interactions types place various requirements on the scheduling system • Three most common types • Rigid jobs with tight coupling • MPP environments • a fixed number of processes • communicate and synchronize at a high rate • a dedicated partition of the machine for each job • gang scheduling, time slicing is used, not dedicated

Background (III) • Job Types and Requirements • Three most common types • Rigid jobs with balanced processes and loose interactions • not require that the processes execute simultaneously • require that the processes progress at about the same rate • Jobs structured as a workpile of independent tasks • executed by a number of worker processes that takes from the workpile and execute them, possibly creating new tasks in the process • a very flexible model that allows the number of workers to change at runtime • leads to malleable jobs that are very suitable for NOW environment

Rigid Jobs with Process Migration • Process Migration • The subsystem responsible for the HPC applications doesn't have full control over the system • Process migration involves the remapping of processes to processors during execution • Reasons for migration • The need to relinquish a workstation and return it to its owner • The desire to achieve a balanced load on all workstations • Metrics • Overhead, detach degree • Algorithmic aspects • Which process to migrate • Where to migrate process • These decisions depend on the data • the data that is available about the workload on the node • Issues of load measurement and information dissemination

Case Study: PVM with Migration (I) • PVM is a software package for writing and executing parallel applications on a LAN • Communication / Synchronization operations • Configuration control • Dynamic spawning of processes • To create a virtual parallel machine • A user spawns PVM daemon processes on a set of workstations • Establish communication links among daemons • Creating the infrastructure of the parallel machine • PVM distributes its processes in round-robin manner among the workstations being used • May create unbalanced loads and may lead to unacceptable degradation in the service

PVMD PVMD PVMD PVMD PVM Daemon Processes Local Processes Communication in PVM is Mediated by a Daemon (pvmd) on Each Node

Case Study: PVM with Migration (II) • Several experimental version of PVM • Migratable PVM and Dynamic PVM • Include migration in order to move processes to more suitable locations • PVM has been coupled with MOSIX and Condor to achieve similar benefits

Case Study: PVM with Migration (III) • Migration decisions • By global scheduler • Based on information regarding load and owner activity • Four Steps • The global scheduler notifies the responsible PVM daemon that one of its processes should be migrated • The daemon then notifies all of the other PVM daemon about the pending migration • The process state is transferred to the new location and a new process is created. • The new process connects to the local PVM daemon in the new location, notifies all the other PVM daemons • Migration in this system is asynchronous • Can happen at any time • Affects only the migrating process and other processes that may try to communicate with it

mpvmd mpvmd From GS: migrate VP1 to host2  mpvmd mpvmd mpvmd mpvmd VP1’ VP1 VP2 VP3 State Transfer VP3 VP3 Code Data Heap Stack Code Data Heap Stack UDP VP1 VP1 VP2 VP2 Host1 Host2  Migrating Process Skeleton Process  Flush messages sent to VP1, Block sends to VP1 Host1 Host2  Restart VP1’ Unblock sends to VP1 mpvmd mpvmd VP2 VP3 VP1’ Host1 Host2 Migratable PVM

Case Study: MOSIX (I) • Multicomputer Operating System for unIX • Support adaptive resource sharing in a scalable cluster computing by dynamic process migration • Based on a Unix kernel augmented with • a process migration mechanism • a scalable facility to distribute load information • All processes enjoy about the same level of service • Both sequential jobs and components of parallel jobs • Maintains a balanced load on all the workstations in cluster

Unix BSD Linux ARSA PPM MOSIX Infrastructure Preemptive Process Migration Adaptive Resource Sharing Algorithm

Case Study: MOSIX (II) • The load information is distributed using a randomized algorithm • Each workstation maintains a load vector with data about its own load and the other machine loads • At certain intervals (e.g. once every minute), it sends this information to another, randomly selected machine • With high probability, it will also be selected by some other machine, and receive such a load vector • If it is found that some other machine has a significantly different load, a migration operation is initiated

Case Study: MOSIX (III) • Migration is based on the home-node concept • Each process has a home node: it’s own workstation • When it is migrated, it is split into two parts • body, deputy • The body contains • all the user-level context • site independent kernel context • is migrated to another node • The deputy contains • site dependent kernel context • is left on the home node • A communication link is established between the two parts so that the process can access its local environment via the deputy, and so that other processes can access it • Running PVM over MOSIX leads to improvements in performance

Overloaded Workstation Underloaded Workstation Site-dependentsystem calls deputy body Return values ,signals Process Migration in MOSIX Process migration in MOSIX divides the process into a migratable body and a site-dependent deputy, which remains in the home node.

Malleable Jobs with Dynamic Parallelism • Parallel jobs should adjust to resources • reclaiming by owners at unpredictable times • Emphasizes the dynamics of workstation clusters • parallel jobs should adjust to such varying resources

Identifying Idle Workstations • Use only idle workstations • Using idle workstation to run parallel jobs requires • The ability to identify idle workstation • Monitoring keyboard and mouse activity • The ability to retreat from workstation • when a workstation has to be evicted,the worker is killed and its tasks reassigned to other workers

Case Study: Condor and WoDi (1) • Condor is a system for running batch jobs in the background on a LAN, using idle workstations • reclaimed by owner • suspend the batch process • restarted from a checkpoint on another node • basis for the LoadLeveler product used on IBM workstations

Case Study: Condor and WoDi (2) • Condor was augmented with a CARMI • CARMI (Condor Application Resource Management Interface) • allows jobs • to request additional resources • to be notified if resources are taken away

Case Study: Condor and WoDi (3) • WoDi (Work Distributor) • To support simple programming of master-worker applications • The master process sends work requests (tasks) to the WoDi server

application master application worker tasks results tasks WoDi server results spawn resource requests request Condor scheduler CARMI server CARMI server spawn Allocation Master-Worker Applications using WoDi Server to Coordinate Task Execution

Case Study: Piranha and Linda (1) • Linda • A parallel programming language • A coordination language that can be added to Fortran or C • Based on an associative tuple space that acts as a distributed data repository • Parallel computations are created by injecting unevaluated tuples into the tuple space • The tuple space can also be viewed as a workpile of independent tuples that need to be evaluated

Case Study: Piranha and Linda (2) • Piranha • A system for executing Linda applications on a NOW • Programs that run under Piranha must include three special user-defined functions • feeder, piranha, and retreat • The piranha function • Executed automatically on idle workstation and transform the work tuples into result tuples(r) • The feeder function • generates work tuples(w) • The retreat function • the workstation is reclaimed by its owner

w4 piranha w9 w6 r4 Feeder piranha r1 w8 newly idle r3 w7 w5 piranha r2 retreat tuple space w5 busy workstations reclaimed Piranha programs include a feeder function that generates work tuples(w), and piranha functions that are executed automatically on idle workstation and transform the work tuples into result tuples(r). If a workstation is reclaimed, the retreat function is called to return the unfinished work to tuple space

Communication-Based Coscheduling • If the processes of a parallel application communicate and synchronize frequently • execute simultaneously on different processors • Saves the overhead of frequent context switches • Reduces the need of buffering during communication • Combined with time slicing • provided by gang scheduling • Gang scheduling implies that the participating processes are known in advance • The alternative is to identify them during execution • only a sub-set of the processes are scheduled together, leading to coscheduling rather than gang scheduling

Demand-Based Coscheduling (1) • The decision about what processes should be scheduled together on actual observations of the communication patterns • Requires the cooperation of the communication subsystem • monitors the destination of incoming messages • raises the priority of the destination process • sender process may be coscheduled with destination process • Problem • raising the priority of any process • that receives a message is that it is unfair • multiple parallel jobs co-exist • epoch numbers

P1 P5 P6 P2 P4 P3 1 spontaneous switch 1 1 time 1 1 2 1 1 1 2 1 1 2 1 2 2 2 2 2 Epoch numbers allow a parallel job to take over the whole cluster, in the face of another active job

Demand-Based Coscheduling (2) • The epoch number on each node is incremented when a spontaneous context switch is made • not the result of an incoming message • This epoch number is appended to all outgoing messages. • When a node receives a message, compares its local epoch number with the one in the incoming message • Switch to the destination process only if the incoming epoch number is greater

Implicit Coscheduling (1) • Explicit control may be unnecessary • Using Unix facilities (sockets) • Unix process that perform I/O (including communication) get a higher priority • processes participating in a communication phase will get high priority on their nodes • without any explicit measures being taken • Long phases of computation, intensive communication

Implicit Coscheduling (2) • Make sure that a communicating process is not de-scheduled while it is waiting for a reply from another process • Using two-phase blocking (spin blocking) • A waiting process will initially busy wait (spin) for some time, waiting for the anticipated response • If the response does not arrive within the pre-specified time, the process blocks and relinquishes its processor in favor of another ready process • Implicit coscheduling keeps processes in step only when they are communicating • During the computation phase -> do not need to be coscheduled

Batch Scheduling • Qualitative difference • Between the work done by workstation owners and the parallel jobs that try to use spare cycles • Workstation owner • do interactive work • require immediate response • Parallel Jobs • compute-intensive • run for long periods • to queue them until suitable resources become available

Admission Controls (1) • HPC application places a heavy load • Consideration for interactive users implies that these HPC applications be curbed if they hog the system • Can refuse to admit them into the system in the first place

Admission Controls (2) • A general solution: a batch scheduling system (ex. DQS, PBS ) • Define a set of queues to which batch jobs are submitted • contains jobs that are characterized by attributes • such as expected run-time, memory requirements • The batch scheduler chooses jobs for execution • based on attributes and the available resources • other jobs are queued so as not to overload the system • To use only idle workstationsvs. use all workstations, with preference for those that are lightlyloaded

Case Study: Utopia/LSF (1) • Utopia is a environment for load sharing on large scale heterogeneous clusters • a mechanism for collecting load information • a mechanism for transparent remote execution • a library to use them from applications • Collection of load info • done by a set of daemons, one on each node • LIM (Load Information Manager) • lowest host ID • collects/distributes load vectors to all slave nodes • recent CPU queue length, memory usage, the number of users • the slaves can use this information to make placement decisions for new processes • using a centralized master does not scale to large systems

LIM L s1 n Node n Strong Servers n L s s1 s2 n L s2 L n s3 s3 n L n Utopia uses a two-level design to spread load information in large systems

Case Study: Utopia/LSF (2) • Support for load sharing across clusters is provided by communication among the master LIMs of the different clusters • Possible to create virtual clusters that group together powerful servers that are physically dispersed across the system • Utopia’s batch scheduling system • Queueing and allocation decisions are done by a master batch daemon, which is co-located with the master LIM • The actual execution and control over the batch processes • Done by slave batch daemons on the various nodes

Summary (1) • It is most important • to balance the loads on the different machines • all processes get equal services • not to interfere with workstations owners • using idle workstation • to provide parallel programs with a suitable environment • simultaneous execution of interacting processes • not to flood the system with low-priority compute intensive jobs • admission controls and batch scheduling are necessary

Summary (2) • There is room for improvement • by considering how to combine multiple assumptions and merge the approaches used in different systems • The following combination is possible • Have a tunable parameter that selects whether workstations are in general shared, or used only if idle • Provide migration to enable jobs to evacuate workstations that become over loaded, are reclaimed by their owner, or become too slow relative to others parallel job • Provide communication-based coscheduling for jobs that seems to need it, better not to require the user to specify it • Provide batch queueing with a check-point facility so that heavy jobs will run only when they do not degrade performance for others

High Performance Cluster Computing Architectures and Systems