1 / 45

Single System Abstractions for Clusters of Workstations

Single System Abstractions for Clusters of Workstations. Bienvenido Vélez. What is a cluster?. A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one. Possible System Abstractions. System Abstraction. Characterized by.

bud
Download Presentation

Single System Abstractions for Clusters of Workstations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Single System Abstractionsfor Clusters of Workstations Bienvenido Vélez

  2. What is a cluster? A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one Possible System Abstractions System Abstraction Characterized by Massively parallel processor Fine grain parallelism Multi-programmed system Coarse grain concurrency Independent Nodes Fast interconnects Transparency is a goal

  3. Question Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria: • Transparency • Availability • Scalability

  4. Contributions • Improvements to the Microsoft Cluster Service • better availability and scalability • Adaptive Replication • Automatically adapting replication levels to maintain availability as cluster grows

  5. Outline • Comparison of approaches • Transparent remote execution (GLUnix) • Preemptive load balancing (MOSIX) • Highly available servers (Microsoft Cluster Service) • Contributions • Improvements to the MS Cluster Service • Adaptive Replication • Conclusions

  6. Startup (glurun) GLUnix Transparent Remote Execution master node master daemon Execute (make, env) signals fork stdin node daemon “glurun make” user exec make stdout, stderr remote node (selected by master) home node • Dynamic load balancing

  7. GLUnixVirtues and Limitations • Transparency • home node transparency limited by user-level implementation • interactive jobs supported • special commands for running cluster jobs • Availability • detects and masks node failures • master process is single point of failure • Scalability • master process performance bottleneck

  8. MOSIXPreemptive Load Balancing 1 3 4 2 5 node process • probabilistic diffusion of load information • redirects system calls to home node

  9. MOSIXPreemptive Load Balancing Exchange local load with random node delay Consider migrating a process to a node with minimal cost • Keeps load information from fixed number nodes • load = average size of ready queue • cost = f(cpu time) + f(communication) + f(migration time)

  10. MOSIXVirtues and Limitations • Transparency • limited home node transparency • Availability • masks node failures • no process restart • preemptive load balancing limits portability and performance • Scalability • flooding and swinging possible • low communication overhead

  11. clients clients Microsoft Cluster Service (MSCS)Highly available server processes Web SQL MSCS MSCS status status • replicated consistent node/server status database • migrates servers from failed nodes

  12. quorum HTML RDB Microsoft Cluster Service Hardware Configuration ethernet Web SQL bottleneck status status SCSI status single points of failure

  13. MSCSVirtues and Limitations Transparency • server migration transparent to clients Availability • servers migrated from failed nodes • shared disk are single points of failure Scalability • manual static configuration • manual static load balancing • shared disk bus is performance bottleneck

  14. Summary of Approaches System Transparency Availability Scalability GLUnix home node limited single point of failure masks failures no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over load balancing MSCS clients server fail-over single point of failure bottleneck

  15. node 1 node 2 node n Transaction-basedReplication operates on object write[x] replication operates on copies { write[x1], …, write[xn] } transactions

  16. Re-designing MSCS • Idea: New core resource group fixed on every node • special disk resource • distributed transaction processing resource • transactional replicated file storage resource • Implement consensus with transactions (El-Abbadi-Toueg algorithm) • changes to configuration DB • cluster membership service • Improvements • eliminates complex global update and regroup protocols • switchover not required for application data • provides new generally useful service • Transactional replicated object storage

  17. Re-designed MSCSwith transactional replicated object storage Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC Transaction Service Replicated Storage Svc network

  18. Adaptive ReplicationProblem What should a replication service do when nodes are added to the cluster? • Must alternate migration with replication • Replication (R) should happen significantly less often that migration (M) replication vs. migration Goal: Maintain availability Hypothesis

  19. Replication increases number of copies of objects 2 nodes X y X y 2 nodes added X y X y X y X y 4 nodes

  20. Migration re-distributes objects across all nodes 2 nodes X y X y 2 nodes added X x y y 4 nodes

  21. Simplifying Assumptions • System keeps same number of copies k of each object • System has n nodes • Initially n = k • n increases k nodesat a time • ignore partitions in computing availability

  22. ConjectureHighest availability can be obtained if objects partitioned in q = n / kgroups living disjoint sets of nodes. Example: k = 3, n = 6, q = 2 X’ X’ X’ q X” X” X” k Lets call this optimal migration

  23. Adaptive Replication Necessary Let each node have availability p The availability of the system is: A(k,n) = 1 - q * pk Since optimal migration always increases q, migration decreases availability (albeit slowly) Adaptive replication may be necessary to maintain availability

  24. Adaptive ReplicationFurther Work • determine when it matters in real situations • relax assumptions • formalize arguments

  25. “Home Node” Single System Image

  26. Talk focuses on Coarse Grain Layer System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages, trasparent remote execution, message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP node regroup, resource failover switchover ParaStation NET, FGP user level protocol stack with semaphores

  27. GLUnixCharacteristics • Provides special user commands for managing cluster jobs • Both batch and interactive jobs can be executed remotely • Supports dynamic load balancing

  28. consider Select target node N that minimizes cost C[N] of running p there migrate to N OK? return MOSIX: preemptive load balancing load balance less loaded node exists? no yes Select candidate process p with maximal impact on local load no yes p can migrate? no yes signal p to consider migration return

  29. xFS distribued log-based file system data block data stripes 1 log segment (dirty data blocks) 2 3 parity stripe client writes are always sequential 1 2 3 stripe group

  30. xFSVirtues and Limitations • Exploits aggregate bandwidth of all disks • No need to buy expensive RAID’s • No single point of failure • Reliability: Relies of accumulating dirty blocks to generate large sequential writes • Adaptive replication potentially more difficult

  31. Microsoft Cluster Service (MSCS) GOAL Off-the-shelf Server Application Cluster-aware Server Application Wrapper Highly Available

  32. MSCSAbstractions • Node • Resource • e.g. disks, IP addresses, server • Resource dependency • e.g. DBMS depends on disk holding its data • Resource group • e.g. server and its IP number • Quorum resource • logs configuration data • breaks ties during membership changes

  33. MSCSGeneral Characteristics • Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol) • Node and resource failures detected • Resources of failed nodes migrated to surviving nodes • Failed resources restarted

  34. Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC resource resource MSCS System Architecture network

  35. MSCS virtually synchronous regroup operation regroup Activate • determine nodes in its connected component • determine if its component is the primary • elect new tie-breaker • if node new tie breaker then broadcast • component as new membership Closing Pruning • if not in the new membership halt Cleanup 1 • install new membership from new tie breaker • acknowledge “ready to commit” Cleanup 2 • if own quorum disk, log membership change end

  36. MSCSPrimary Component Determination Rule • node connected to a majority of previous membership • node connected to half (>=2) of the previous members and one of those is a tie-breaker • node isolated and previous membership had two nodes and node owned quorum resource during previous membership A node is in the primary component if one of the following holds

  37. MSCS switchover SCSI SCSI Every disk a single point of failure! node failure Alternative: Replication

  38. Summary of Approaches System Transparency Availability Performance Berkeley NOW home node limited single point of failure no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over tolerates partitions load balancing low msg overhead MSCS server single point of failure low MTTR tolerates partitions bottleneck

  39. Comparing Approaches Design Goals System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages transparent remote execution Message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP cluster membership services resource fail-over ParaStation NET, FGP user level protocol stack network interface hardware

  40. Comparing Approaches Global Information Management System Approach Description Berkeley NOW centralized processes run to completion once assigned to processor MOSIX distributed : probabilistic processes brought offline at source and online at destination MSCS replicated : consistent process migrated at any point during execution

  41. System Failure detection Recovery action Berkeley NOW detected by master daemon timeouts failed nodes removed from central configuration DB MOSIX detected by individual nodes timeouts failed nodes removed from local configuration DB MSCS detected by individual nodes heartbeats failed nodes removed from replicated configuration DB resources restarted/migrated Comparing Approaches Fault-tolerance System Single Points of Failure Possible solution Berkeley NOW master process process pairs MOSIX none N.A. MSCS quorum resource shared disks virtual partitions replication algorithm

  42. Comparing Approaches Load Balancing System Approach Description MSCS manual sys admin manually assigns processes to nodes static processes statically assigned to processors Berkeley NOW dynamic uses dynamic load information to assign processes to processors MOSIX preemptive migrates processes in the middle of their execution

  43. Comparing Approaches Process Migration System Process Migration Approach Description Berkeley NOW none processes run to completion once assigned to processor MSCS cooperative shutdown/restart processes brought offline at source and online at destination MOSIX transparent process migrated at any point during execution

  44. Example: k = 3, n = 3 X x x Each letter (e.g. x above) represents a group of objects with copies in the same subset of nodes

  45. fail-over/ failback switch-over redundancy error-correcting codes replication MSCS RAID xFS primary copy voting (quorum consensus) HARP voting w/ views (virtual partitions)

More Related