Single System Abstractions for Clusters of Workstations

Single System Abstractionsfor Clusters of Workstations Bienvenido Vélez

What is a cluster? A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one Possible System Abstractions System Abstraction Characterized by Massively parallel processor Fine grain parallelism Multi-programmed system Coarse grain concurrency Independent Nodes Fast interconnects Transparency is a goal

Question Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria: • Transparency • Availability • Scalability

Contributions • Improvements to the Microsoft Cluster Service • better availability and scalability • Adaptive Replication • Automatically adapting replication levels to maintain availability as cluster grows

Outline • Comparison of approaches • Transparent remote execution (GLUnix) • Preemptive load balancing (MOSIX) • Highly available servers (Microsoft Cluster Service) • Contributions • Improvements to the MS Cluster Service • Adaptive Replication • Conclusions

Startup (glurun) GLUnix Transparent Remote Execution master node master daemon Execute (make, env) signals fork stdin node daemon “glurun make” user exec make stdout, stderr remote node (selected by master) home node • Dynamic load balancing

GLUnixVirtues and Limitations • Transparency • home node transparency limited by user-level implementation • interactive jobs supported • special commands for running cluster jobs • Availability • detects and masks node failures • master process is single point of failure • Scalability • master process performance bottleneck

MOSIXPreemptive Load Balancing 1 3 4 2 5 node process • probabilistic diffusion of load information • redirects system calls to home node

MOSIXPreemptive Load Balancing Exchange local load with random node delay Consider migrating a process to a node with minimal cost • Keeps load information from fixed number nodes • load = average size of ready queue • cost = f(cpu time) + f(communication) + f(migration time)

MOSIXVirtues and Limitations • Transparency • limited home node transparency • Availability • masks node failures • no process restart • preemptive load balancing limits portability and performance • Scalability • flooding and swinging possible • low communication overhead

clients clients Microsoft Cluster Service (MSCS)Highly available server processes Web SQL MSCS MSCS status status • replicated consistent node/server status database • migrates servers from failed nodes

quorum HTML RDB Microsoft Cluster Service Hardware Configuration ethernet Web SQL bottleneck status status SCSI status single points of failure

MSCSVirtues and Limitations Transparency • server migration transparent to clients Availability • servers migrated from failed nodes • shared disk are single points of failure Scalability • manual static configuration • manual static load balancing • shared disk bus is performance bottleneck

Summary of Approaches System Transparency Availability Scalability GLUnix home node limited single point of failure masks failures no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over load balancing MSCS clients server fail-over single point of failure bottleneck

node 1 node 2 node n Transaction-basedReplication operates on object write[x] replication operates on copies { write[x1], …, write[xn] } transactions

Re-designing MSCS • Idea: New core resource group fixed on every node • special disk resource • distributed transaction processing resource • transactional replicated file storage resource • Implement consensus with transactions (El-Abbadi-Toueg algorithm) • changes to configuration DB • cluster membership service • Improvements • eliminates complex global update and regroup protocols • switchover not required for application data • provides new generally useful service • Transactional replicated object storage

Re-designed MSCSwith transactional replicated object storage Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC Transaction Service Replicated Storage Svc network

Adaptive ReplicationProblem What should a replication service do when nodes are added to the cluster? • Must alternate migration with replication • Replication (R) should happen significantly less often that migration (M) replication vs. migration Goal: Maintain availability Hypothesis

Replication increases number of copies of objects 2 nodes X y X y 2 nodes added X y X y X y X y 4 nodes

Migration re-distributes objects across all nodes 2 nodes X y X y 2 nodes added X x y y 4 nodes

Simplifying Assumptions • System keeps same number of copies k of each object • System has n nodes • Initially n = k • n increases k nodesat a time • ignore partitions in computing availability

ConjectureHighest availability can be obtained if objects partitioned in q = n / kgroups living disjoint sets of nodes. Example: k = 3, n = 6, q = 2 X’ X’ X’ q X” X” X” k Lets call this optimal migration

Adaptive Replication Necessary Let each node have availability p The availability of the system is: A(k,n) = 1 - q * pk Since optimal migration always increases q, migration decreases availability (albeit slowly) Adaptive replication may be necessary to maintain availability

Adaptive ReplicationFurther Work • determine when it matters in real situations • relax assumptions • formalize arguments

“Home Node” Single System Image

Talk focuses on Coarse Grain Layer System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages, trasparent remote execution, message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP node regroup, resource failover switchover ParaStation NET, FGP user level protocol stack with semaphores

GLUnixCharacteristics • Provides special user commands for managing cluster jobs • Both batch and interactive jobs can be executed remotely • Supports dynamic load balancing

consider Select target node N that minimizes cost C[N] of running p there migrate to N OK? return MOSIX: preemptive load balancing load balance less loaded node exists? no yes Select candidate process p with maximal impact on local load no yes p can migrate? no yes signal p to consider migration return

xFS distribued log-based file system data block data stripes 1 log segment (dirty data blocks) 2 3 parity stripe client writes are always sequential 1 2 3 stripe group

xFSVirtues and Limitations • Exploits aggregate bandwidth of all disks • No need to buy expensive RAID’s • No single point of failure • Reliability: Relies of accumulating dirty blocks to generate large sequential writes • Adaptive replication potentially more difficult

Microsoft Cluster Service (MSCS) GOAL Off-the-shelf Server Application Cluster-aware Server Application Wrapper Highly Available

MSCSAbstractions • Node • Resource • e.g. disks, IP addresses, server • Resource dependency • e.g. DBMS depends on disk holding its data • Resource group • e.g. server and its IP number • Quorum resource • logs configuration data • breaks ties during membership changes

MSCSGeneral Characteristics • Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol) • Node and resource failures detected • Resources of failed nodes migrated to surviving nodes • Failed resources restarted

Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC resource resource MSCS System Architecture network

MSCS virtually synchronous regroup operation regroup Activate • determine nodes in its connected component • determine if its component is the primary • elect new tie-breaker • if node new tie breaker then broadcast • component as new membership Closing Pruning • if not in the new membership halt Cleanup 1 • install new membership from new tie breaker • acknowledge “ready to commit” Cleanup 2 • if own quorum disk, log membership change end

MSCSPrimary Component Determination Rule • node connected to a majority of previous membership • node connected to half (>=2) of the previous members and one of those is a tie-breaker • node isolated and previous membership had two nodes and node owned quorum resource during previous membership A node is in the primary component if one of the following holds

MSCS switchover SCSI SCSI Every disk a single point of failure! node failure Alternative: Replication

Summary of Approaches System Transparency Availability Performance Berkeley NOW home node limited single point of failure no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over tolerates partitions load balancing low msg overhead MSCS server single point of failure low MTTR tolerates partitions bottleneck

Comparing Approaches Design Goals System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages transparent remote execution Message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP cluster membership services resource fail-over ParaStation NET, FGP user level protocol stack network interface hardware

Comparing Approaches Global Information Management System Approach Description Berkeley NOW centralized processes run to completion once assigned to processor MOSIX distributed : probabilistic processes brought offline at source and online at destination MSCS replicated : consistent process migrated at any point during execution

System Failure detection Recovery action Berkeley NOW detected by master daemon timeouts failed nodes removed from central configuration DB MOSIX detected by individual nodes timeouts failed nodes removed from local configuration DB MSCS detected by individual nodes heartbeats failed nodes removed from replicated configuration DB resources restarted/migrated Comparing Approaches Fault-tolerance System Single Points of Failure Possible solution Berkeley NOW master process process pairs MOSIX none N.A. MSCS quorum resource shared disks virtual partitions replication algorithm

Comparing Approaches Load Balancing System Approach Description MSCS manual sys admin manually assigns processes to nodes static processes statically assigned to processors Berkeley NOW dynamic uses dynamic load information to assign processes to processors MOSIX preemptive migrates processes in the middle of their execution

Comparing Approaches Process Migration System Process Migration Approach Description Berkeley NOW none processes run to completion once assigned to processor MSCS cooperative shutdown/restart processes brought offline at source and online at destination MOSIX transparent process migrated at any point during execution

Example: k = 3, n = 3 X x x Each letter (e.g. x above) represents a group of objects with copies in the same subset of nodes

fail-over/ failback switch-over redundancy error-correcting codes replication MSCS RAID xFS primary copy voting (quorum consensus) HARP voting w/ views (virtual partitions)

Single System Abstractions for Clusters of Workstations

Single System Abstractions for Clusters of Workstations

Presentation Transcript

Operating System Abstractions for GPU Programming

The Globular Clusters system of M31

Kindergarten Workstations

Abstractions

PHY Abstractions Types For HEW System Level Simulations

29 th Floor, v1.0 10 workstations for ALIGN 3 workstations for UFCW 3 workstations for CTW

A parallel File System for Networks of Windows Workstations

Networks of Workstations

Hunter of Idle Workstations

Groups, Clusters and Clusters of Clusters

Abstractions

Abstractions for Network Update

Cluster Workstations

Output analyses for single system

Single Phase System

Types Of Office Workstations

A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations

Organization of Workstations

Networks of Workstations

Workstations CPU

Single Phase System

Autocad Workstations