ft nt a tutorial on microsoft cluster server formerly wolfpack l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) PowerPoint Presentation
Download Presentation
FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”)

Loading in 2 Seconds...

play fullscreen
1 / 112

FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”). Joe Barrera Jim Gray Microsoft Research {joebar, gray} @ microsoft.com http://research.microsoft.com/barc. Outline. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”)' - lore


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ft nt a tutorial on microsoft cluster server formerly wolfpack

FT NT: A Tutorial on Microsoft Cluster Server™(formerly “Wolfpack”)

Joe Barrera

Jim Gray

Microsoft Research

{joebar, gray} @ microsoft.com

http://research.microsoft.com/barc

outline
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
dependability the 3 ities
DEPENDABILITY: The 3 ITIES
  • RELIABILITY / INTEGRITY: Does the right thing.(also large MTTF)
  • AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?(=>89% of transactions are serviced on time).
  • Holistic vs. Reductionist view

Security

Integrity

Reliability

Availability

case study japan survey on computer security japan info dev corp march 1986 trans eiichi watanabe
Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).

Vendor

Vendor (hardware and software) 5 Months

Application software 9 Months

Communications lines 1.5 Years

Operations 2 Years

Environment 2 Years

10 Weeks

1,383 institutions reported (6/84 - 7/85)

7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES

To Get 10 Year MTTF, Must Attack All These Areas

4

2

%

Tele Comm lines

1

2

%

1

1

.

2

Environment

%

2

5

%

Application

Software

9

.

3

%

Operations

case studies tandem trends
Case Studies - Tandem Trends

MTTF improved

Shift from Hardware & Maintenance to from 50% to 10%

to Software (62%) & Operations (15%)

NOTE: Systematic under-reporting of Environment

Operations errors

Application Software

summary of ft studies
Summary of FT Studies
  • Current Situation: ~4-year MTTF => Fault Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations:
    • New Software.
    • Utilities.
  • Must make all software ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal: 100-year MTTF. class 4 today=>class 6tomorrow.
fault tolerance vs disaster tolerance
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance: mask local faults
    • RAID disks
    • Uninterruptible Power Supplies
    • Cluster Failover
  • Disaster Tolerance: masks site failures
    • Protects against fire, flood, sabotage,..
    • Redundant system and service at remote site.
the microsoft vision plug play dependability
The Microsoft “Vision”: Plug & Play Dependability
  • Transactions for reliability
  • Clusters: for availability
  • Security
  • All built into the OS

Integrity

Security

Integrity /

Reliability

Availability

cluster goals
Cluster Goals
  • Manageability
    • Manage nodes as a single system
    • Perform server maintenance without affecting users
    • Mask faults, so repair is non-disruptive
  • Availability
    • Restart failed applications & servers
      • un-availability ~ MTTR / MTBF , so quick repair.
    • Detect/warn administrators of failures
  • Scalability
    • Add nodes for incremental
      • processing
      • storage
      • bandwidth
fault model
Fault Model
  • Failures are independentSo, single fault tolerance is a big win
  • Hardware fails fast (blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot:
    • Heisenbugs
  • Operations tasks: major source of outage
    • Utility operations
    • Software upgrades
cluster servers combined to improve availability scalability

Client PCs

Printers

Server A

Server B

Interconnect

Disk array B

Disk array A

Cluster: Servers Combined to Improve Availability & Scalability
  • Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image).
  • Node: A server in a cluster. May be an SMP server.
  • Interconnect: Communications link used for intra-cluster status info such as “heartbeats”. Can be Ethernet.
microsoft cluster server
Microsoft Cluster Server™
  • 2-node availability Summer 97 (20,000 Beta Testers now)
    • Commoditize fault-tolerance (high availability)
    • Commodity hardware (no special hardware)
    • Easy to set up and manage
    • Lots of applications work out of the box.
  • 16-node scalability later (next year?)
failover example

Browser

Server 1

Server 2

Failover Example

Server 1

Server 2

Web site

Web site

Database

Database

Web site files

Database files

ms press failover demo
MS Press Failover Demo
  • Client/Server
  • Software failure
  • Admin shutdown
  • Server failure

Resource States

- Pending

- Partial

- Failed

- Offline

!

demo configuration

Server “Alice”

SMP Pentium® Pro Processors

Windows NT Server with Wolfpack

Microsoft Internet Information Server

Microsoft SQL Server

Server “Betty”

SMP Pentium® Pro Processors

Windows NT Server with Wolfpack

Microsoft Internet Information Server

Microsoft SQL Server

Interconnect

standard Ethernet

Local

Disks

Local

Disks

Shared

Disks

Client

Windows NT Workstation

Internet Explorer

MS Press OLTP app

Administrator

Windows NT Workstation

Cluster Admin

SQL Enterprise Mgr

Demo Configuration

SCSI Disk Cabinet

Windows NT Server Cluster

demo administration

Local

Disks

Local

Disks

Shared

Disks

  • Cluster Admin Console
  • Windows GUI
  • Shows cluster resource status
  • Replicates status to all servers
  • Define apps & related resources
  • Define resource dependencies
  • Orchestrates recovery order
  • SQL Enterprise Mgr
  • Windows GUI
  • Shows server status
  • Manages many servers
  • Start, stop manage DBs
Demo Administration

Server “Alice”

Runs SQL Trace

Runs Globe

Server “Betty”

Run SQL Trace

SCSI Disk Cabinet

Windows NT Server Cluster

Client

generic stateless application rotating globe
Generic Stateless ApplicationRotating Globe
  • Mplay32 is generic app.
  • Registered with MSCS
  • MSCS restarts it on failure
  • Move/restart ~ 2 seconds
  • Fail-over if
    • 4 failures (= process exits)
    • in 3 minutes
    • settable default
demo moving or failing over an application

X

X

AVI Application

AVI Application

Local

Disks

Local

Disks

Shared

Disks

Alice Fails or Operator Requests move

Demo Moving or Failing Over An Application

SCSI Disk Cabinet

Windows NT Server Cluster

generic stateful application notepad
Generic Stateful ApplicationNotePad
  • Notepad saves state on shared disk
  • Failure before save => lost changes
  • Failover or move (disk & state move)
demo step 1 alice delivering service

Local

Disks

Local

Disks

Shared

Disks

Demo Step 1: Alice Delivering Service

SQL Activity

No SQL Activity

SQL

SQL

ODBC

ODBC

SCSI Disk Cabinet

IIS

IIS

Windows NT Server Cluster

IP

HTTP

2 request move to betty

No SQL Activity

SQL Activity

SQL

SQL

Local

Disks

Local

Disks

ODBC

ODBC

IIS

IIS

Shared

Disks

IP

IP

2: Request Move to Betty

SCSI Disk Cabinet

Windows NT Server Cluster

HTTP

3 betty delivering service

No SQL Activity

SQL Activity

.

Local

Disks

Local

Disks

Shared

Disks

IP

3: Betty Delivering Service

SQL

SQL

ODBC

ODBC

SCSI Disk Cabinet

IIS

IIS

Windows NT Server Cluster

4 power fail betty alice takeover

No SQL Activity

SQL Activity

SQL

SQL

Local

Disks

Local

Disks

ODBC

ODBC

IIS

Shared

Disks

IIS

IP

IP

4: Power Fail Betty, Alice Takeover

SCSI Disk Cabinet

Windows NT Server Cluster

5 alice delivering service

Local

Disks

Local

Disks

Shared

Disks

5: Alice Delivering Service

SQL Activity

No SQL Activity

SQL

ODBC

SCSI Disk Cabinet

IIS

Windows NT Server Cluster

IP

HTTP

6 reboot betty now can takeover

SQL

Local

Disks

ODBC

Local

Disks

IIS

Shared

Disks

6: Reboot Betty, now can takeover

SQL Activity

No SQL Activity

SQL

ODBC

SCSI Disk Cabinet

IIS

Windows NT Server Cluster

IP

HTTP

outline26
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
cluster and nt abstractions
Cluster and NT Abstractions

Resource

Cluster

Group

Cluster Abstractions

NT Abstractions

Service

Domain

Node

basic nt abstractions
Basic NT Abstractions

Service

Domain

Node

  • Service: program or device managed by a node
    • e.g., file service, print service, database server
    • can depend on other services (startup ordering)
    • can be started, stopped, paused, failed
  • Node:a single (tightly-coupled) NT system
    • hosts services; belongs to a domain
    • services on node always remain co-located
    • unit of service co-location; involved in naming services
  • Domain:a collection of nodes
    • cooperation for authentication, administration, naming
cluster abstractions
Cluster Abstractions

Resource

Cluster

Resource

Group

  • Resource: program or device managed by a cluster
    • e.g., file service, print service, database server
    • can depend on other resources (startup ordering)
    • can be online, offline, paused, failed
  • Resource Group:a collection of related resources
    • hosts resources; belongs to a cluster
    • unit of co-location; involved in naming resources
  • Cluster:a collection of nodes, resources, and groups
    • cooperation for authentication, administration, naming
resources
Resources

Resource

Cluster

Group

Resources have...

  • Type: what it does (file, DB, print, web…)
  • An operational state (online/offline/failed)
  • Current and possiblenodes
  • Containing Resource Group
  • Dependencies on other resources
  • Restart parameters (in case of resource failure)
resource types
Built-in types

Generic Application

Generic Service

Internet Information Server (IIS) Virtual Root

Network Name

TCP/IP Address

Physical Disk

FT Disk (Software RAID)

Print Spooler

File Share

Added by others

Microsoft SQL Server,

Message Queues,

Exchange Mail Server,

Oracle,

SAP R/3

Your application? (use developer kit wizard).

Resource Types
resource states
Resources states:

Offline:exists, not offering service

Online:offering service

Failed:not able to offer service

Resource failure may cause:

local restart

other resources to gooffline

resource group to move

(all subject to group and resource parameters)

Resource failure detected by:

Polling failure

Node failure

Online

Online

Pending

Failed

Offline

Resource States

I’m

Online!

Go

Off-line!

Offline

Pending

I’m

here!

Go

Online!

I’m

Off-line!

resource dependencies

File Share

Network Name

IIS Virtual Root

IP Address

Resource DLL

Resource Dependencies
  • Similar to NT Service Dependencies
  • Orderly startup & shutdown
    • A resource is brought online after any resources it depends on are online.
    • A Resource is taken offline before any resources it depends on
  • Interdependent resources
    • Form dependency trees
    • move among nodes together
    • failover together
    • as per resource group
nt registry
NT Registry
  • Stores all configuration information
    • Software
    • Hardware
  • Hierarchical (name, value) map
  • Has a open, documented interface
  • Is secure
  • Is visible across the net (RPC interface)
  • Typical Entry:

\Software\Microsoft\MSSQLServer\MSSQLServer\

DefaultLogin = “GUEST”

DefaultDomain = “REDMOND”

cluster registry
Cluster Registry
  • Separate from local NT Registry
  • Replicated at each node
    • Algorithms explained later
  • Maintains configuration information:
    • Cluster members
    • Cluster resources
    • Resource and group parameters (e.g. restart)
  • Stable storage
  • Refreshed from “master” copy when node joins cluster
other resource properties
Name

Restart policy (restart N times, failover…)

Startup parameters

Private configuration info (resource type specific)

Per-node as well, if necessary

Poll Intervals (LooksAlive, IsAlive, Timeout)

These properties are all kept in Cluster Registry

Other Resource Properties
resource groups
Resource Groups

Resource

Cluster

Group

  • Every resource belongs to a resource group.
  • Resource groups move (failover) as a unit
  • Dependencies NEVER cross groups. (Dependency trees contained within groups.)
  • Group may contain forest of dependency trees

Payroll Group

Web Server

SQL

Server

IP Address

Drive E:

Drive F:

group properties
Group Properties
  • CurrentState: Online, Partially Online, Offline
  • Members: resources that belong to group
    • members determine which nodes can host group.
  • Preferred Owners:ordered list of host nodes
  • FailoverThreshold: How many faults cause failover
  • FailoverPeriod: Time window for failover threshold
  • FailbackWindowsStart: When can failback happen?
  • FailbackWindowEnd: When can failback happen?
  • Everything (except CurrentState) is stored in registry
failover and failback

Failover

Failback

Failover and Failback
  • Failover parameters
    • timeout on LooksAlive, IsAlive
    • # local restarts in failure window after this, offline.
  • Failback to preferred node
    • (during failback window)
  • Do resource failures affect group?

Node \\Betty

Node \\Alice

Cluster

Service

Cluster

Service

IPaddr

name

cluster concepts clusters
Cluster ConceptsClusters

Resource

Cluster

Group

Resource

Group

Resource

Group

Resource

Group

cluster properties
Cluster Properties
  • Defined Members:nodes that can join the cluster
  • Active Members: nodes currently joined to cluster
  • Resource Groups:groups in a cluster
  • Quorum Resource:
    • Stores copy of cluster registry.
    • Used to form quorum.
  • Network:Which network used for communication
  • All properties kept in Cluster Registry
cluster api functions operations on nodes groups
Cluster API Functions(operations on nodes & groups)
  • Find and communicate with Cluster
  • Query/Set Cluster properties
  • Enumerate Cluster objects
    • Nodes
    • Groups
    • Resources and Resource Types
  • Cluster Event Notifications
    • Node state and property changes
    • Group state and property changes
    • Resource state and property changes
slide54
Demo
  • Server startup and shutdown
  • Installing applications
  • Changing status
  • Failing over
  • Transferring ownership of groups or resources
  • Deleting Groups and Resources
outline55
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
architecture
Top tier provides cluster abstractions

Middle tier provides distributed operations

Bottom tier is NT and drivers

Architecture

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Quorum

Membership

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

membership and regroup
Membership:

Used for orderly addition and removal from{ active nodes }

Regroup:

Used for failure detection (via heartbeat messages)

Forceful eviction from{ active nodes }

Membership and Regroup

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

membership
Defined cluster = all nodes

Active cluster:

Subset of defined cluster

Includes Quorum Resource

Stable (no regroup in progress)

Membership

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

quorum resource
Quorum Resource
  • Usually (but not necessarily) a SCSI disk
  • Requirements:
    • Arbitrates for a resource by supporting the challenge/defense protocol
    • Capable of storing cluster registry and logs
  • Configuration Change Logs
    • Tracks changes to configuration database when any defined member missing (not active)
    • Prevents configuration partitions in time
challenge defense protocol
Challenge/Defense Protocol
  • SCSI-2 has reserve/release verbs
    • Semaphore on disk controller
  • Owner gets lease on semaphore
  • Renews lease once every 3 seconds
  • To preempt ownership:
    • Challenger clears semaphore (SCSI bus reset)
    • Waits 10 seconds
      • 3 seconds for renewal + 2 seconds bus settle time
      • x2 to give owner two chances to renew
    • If still clear, then former owner loses lease
    • Challenger issues reserve to acquire semaphore
challenge defense protocol successful defense

Defender Node

Reserve

Reserve

Reserve

Reserve

Reserve

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Bus Reset

Reservation

detected

Challenger Node

Challenge/Defense Protocol:Successful Defense
challenge defense protocol successful challenge

Reserve

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Reserve

Bus Reset

Challenge/Defense Protocol:Successful Challenge

Defender Node

No

reservation

detected

Challenger Node

regroup
Invariant: All members agree on { members }

Regroup re-computes { members }

Each node sends heartbeat message to a peer (default is one per second)

Regroup if two lost heartbeat messages

suspicion that sender is dead

failure detection in bounded time

Uses a 5-round protocol to agree.

Checks communication among nodes.

Suspected missing node may survive.

Upper levels (global update, etc.) informed of regroup event.

Regroup

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

membership state machine
Membership State Machine

Initialize

Search or

Reserve Fails

Sleeping

Start Cluster

Member

Search

Quorum

Disk Search

Search Fails

Regroup

Minority or

no Quorum

Acquire (reserve)

Quorum

Disk

Found

Online

Member

Non-Minority

and Quorum

Lost

Heartbeat

Forming

Joining

Join

Succeeds

Synchronize

Succeeds

Online

joining a cluster
Joining a Cluster
  • When a node starts up, it mounts and configures only local, non-cluster devices
  • Starts Cluster Service which
    • looks in local (stale) registry for members
    • Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found.)
  • Sponsor (any active member)
    • Sponsor authenticates applicant
    • Broadcasts applicant to cluster members
    • Sponsor sends updated registry to applicant
    • Applicant becomes a cluster member
forming a cluster when joining fails
Forming a Cluster(when Joining fails)
  • Use registry to find quorum resource
  • Attach to (arbitrate for) quorum resource
  • Update cluster registry from quorum resource
    • e.g. if we were down when it was in use
  • Form new one-node cluster
  • Bring other cluster resources online
  • Let others join your cluster
leaving a cluster gracefully
Leaving A Cluster (Gracefully)
  • Pause:
    • Move all groups off this member.
    • Change to paused state (remains a cluster member)
  • Offline:
    • Move all groups off this member.
    • Sends ClusterExit message all cluster members
      • Prevents regroup
      • Prevents stalls during departure transitions
    • Close Cluster connections (now not an active cluster member)
    • Cluster service stops on node
  • Evict: remove node from defined member list
leaving a cluster node failure
Leaving a Cluster (Node Failure)
  • Node (or communication) failure triggers Regroup
  • If after regroup:
    • Minority group OR no quorum device:
      • group does NOT survive
    • Non-minority group AND quorum device:
      • group DOES survive
  • Non-Minority rule:
    • Number of new members >= 1/2 old active cluster
    • Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster
  • Quorum guarantees correctness
    • Prevents “split-brain”
      • e.g. with newly forming cluster containing a single node
global update
Propagates updates to all nodes in cluster

Used to maintain replicated cluster registry

Updates are atomic and totally ordered

Tolerates all benign failures.

Depends on membership

all are up

all can communicate

R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.

Global Update

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

global update algorithm

L

X=100!

ack

S

Global Update Algorithm
  • Cluster has locker node that regulates updates.
    • Oldest active node in cluster
  • Send Update to locker node
  • Update other (active) nodes
    • in seniority order (e.g. locker first)
    • this includes the updating node
  • Failure of all updated nodes:
    • Update never happened
    • Updated nodes will roll back on recovery
  • Survival of any updated nodes:
    • New locker is oldest and so has update if any do.
    • New locker restarts update

L

S

cluster registry71
Separate from local NT Registry

Maintains cluster configuration

members, resources, restart parameters, etc.

Stable storage

Replicated at each member

Global Update protocol

NT Registry keeps local copy

Cluster Registry

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

cluster registry bootstrapping
Cluster Registry Bootstrapping

Failover Manager

  • Membership uses Cluster Registry for list of nodes
    • …Circular dependency
  • Solution:
    • Membership uses stale local cluster registry
    • Refresh after joining or forming cluster
    • Master is either
      • quorum device, or
      • active members

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

resource monitor
Polls resources:

IsAlive and LooksAlive

Detects failures

polling failure

failure event from resource

Higher levels tell it

Online, Offline

Restart

Resource Monitor

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

failover manager
Assigns groups to nodes based on

Failover parameters

Possible nodes for each resource in group

Preferred nodes for resource group

Failover Manager

Failover Manager

Resource Monitor

Cluster Registry

Global Update

Membership

Regroup

Windows NT Server

Cluster

Disk Driver

Cluster

Net Drivers

failover resource goes offline
Failover(Resource Goes Offline)

Notify Failover Manager.

Resource Manager

Detects resource error.

Failover Manager checks:

Failover Window and

Failover Threshold

Attempt to

restart resource.

Wait for

Failback Window

Are Failover

conditions

within

Constraints?

No

Has the

Resource

Retry limit

been exceeded?

No

Yes

Leave Group in

partially Online

state.

Yes

Can another

owner be found?

(Arbitration)

No

Switch resource

(and Dependants)

Offline.

Notify Failover Manager

on the new system to

bring resource Online.

Yes

pushing a group resource failure
Pushing a Group (Resource Failure)

Resource Monitor

notifies Resource Manager

of resource failure.

Resource Manager

enumerates all objects in the

Dependency Tree of the failed

resource.

Resource Manager notifies

Failover Manager that the

Dependency Tree is Offline

and needs to fail over.

Resource Manager takes

each depending resource

Offline.

Failover Manager performs

Arbitration to locate a new

owner for the group.

Failover Manager on the

new owner node brings the

resources Online.

Any

resource has

“Affect the Group”

True

Leave Group in

partially Online

state.

No

Yes

pulling a group node failure
Pulling a Group(Node Failure)

Cluster Service

notifies Failover Manager

of node failure.

Failover Manager

determines which groups

were owned by the failed

node.

Failover Manager performs

Arbitration to locate a new

owner for the groups.

Failover Manager on the

new owner(s) bring the

resources Online

in dependency order.

Resource Manager notifies

Failover Manager that the

node is Offline

and the groups it owned

need to fail over.

failback to preferred owner node
Failback to Preferred Owner Node
  • Group may have a Preferred Owner
  • Preferred Owner comes back online
  • Will only occur during the Failback Window (time slot, e.g. at night)

Resource Manager takes

each resource on the

current owner Offline.

Preferred owner

comes back Online.

Failover Manager performs

Arbitration to locate the

Preferred Owner of

the group.

Is the time within

the Failback Window?

Resource Manager notifies

Failover Manager that the

Group is Offline

and needs to fail over to the

Preferred Owner.

Failover Manager on the

Preferred Owner brings

the resources Online.

outline79
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
process structure

Resource

Monitor

Private

calls

Resource

Monitor

Resource

DLL

Private

calls

Process Structure

A Node

  • Cluster Service
    • Failover Manager
    • Cluster Registry
    • Global Update
    • Quorum
    • Membership
  • Resource Monitor
    • Resource Monitor
    • Resource DLLs
  • Resources
    • Services
    • Applications

Cluster

Service

resource control
Resource Control

A Node

  • Commands
    • CreateResource()
    • OnlineResource()
    • OfflineResource()
    • TerminateResource()
    • CloseResource()
    • ShutdownProcess()
  • And resource events

Cluster

Service

Resource

Monitor

Private

calls

Resource

Monitor

DLL

Private

calls

Resource

resource dlls

I’m

Online!

Online

Go

Off-line!

Online

Pending

Offline

Pending

Failed

I’m

here!

Go

Online!

I’m

Off-line!

Offline

Resource DLLs
  • Calls to Resource DLL
    • Open: get handle
    • Online: start offering service
    • Offline: stop offering service
      • as a standby or
      • pair-is offline
    • LooksAlive: Quick check
    • IsAlive: Thorough check
    • Terminate: Forceful Offline
    • Close: release handle

Resource

Monitor

DLL

Private

calls

Resource

Std

calls

cluster communications

DCOM

DCOM

Cluster

Service

DCOM / RPC

Resource

Monitors

Resource

Monitors

Cluster Communications
  • Most communication via DCOM /RPC
  • UDP used for membership heartbeat messages
  • Standard (e.g. Ethernet) interconnects

Management

apps

DCOM / RPC: admin

UDP: Heartbeat

Cluster

Service

DCOM / RPC

Resource

Monitors

Resource

Monitors

outline84
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
application support
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC++ Wizard
  • Cluster API
virtual servers

Virtual

Server

\\a: 1.2.3.4

Virtual Servers

Virtual

Server

\\a:1.2.3.4

  • Problem:
    • Client and Server Applications do not want node name to change when server app moves to another node.
  • A Virtual Server simulates an NT Node
    • Resource Group (name, disks, databases,…)
    • NetName and IP address (node: \\a keeps name and IP address as is moves)
    • Virtual Registry (registry “moves” (is replicated))
    • Virtual Service Control
    • Virtual RPC service
  • Challenges:
    • Limit app to virtual server’s devices and services.
    • Client reconnect on failover (easy if connectionless -- eg web-clients)
virtual servers before failover
Virtual Servers (before failover)
  • Nodes \\Y and \\Z support virtual servers \\A and \\B
  • Things that need to fail over transparently
    • Client connection
    • Server dependencies
    • Service names
    • Binding to local resources
    • Binding to local servers

\\Y

\\Z

SAP

SAP

SQL

SQL

S:\

T:\

\\A

\\B

“SAP on A”

“SAP on B”

virtual servers just after failover

SAP

SAP

SQL

SQL

S:\

T:\

“SAP on A”

“SAP on B”

Virtual Servers (just after failover)

\\Y

\\Z

  • \\Y resources and groups (i.e. Virtual Server \\A)moved to \\Z
  • A resources bind to each other and to local resources (e.g., local file system)
    • Registry
    • Physical resource
    • Security domain
    • Time
  • Transactions used to make DB state consistent.
  • To “work”, local resources on \\Y and \\Z have to be similar
    • E.g. time must remain monotonic after failover

\\A

\\B

address failover and client reconnection
Address Failover andClient Reconnection
  • Name and Address rebind to new node
    • Details later
  • Clients reconnect
    • Failure not transparent
    • Must log on again
    • Client context lost (encourages connectionless)
    • Applications could maintain context

\\Y

\\Z

SAP

SAP

SQL

SQL

S:\

T:\

\\A

\\B

“SAP on A”

“SAP on B”

mapping local references to group relative references
Mapping Local References to Group-Relative References
  • Send client requests to correct server
    • \\A\SAP refers to \\.\SQL
    • \\B\SAP refers to \\.\SQL
  • Must remap references:
    • \\A\SAP to \\.\SQL$A
    • \\B\SAP to \\.\SQL$B
  • Also handles namespace collision
  • Done via
    • modifying server apps, or
    • DLLs to transparently rename

\\Y

\\Z

SAP

SAP

SQL

SQL

S:\

T:\

\\A

\\B

“SAP on A”

“SAP on B”

naming and binding and failover
Naming and Binding and Failover
  • Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services.
    • Applications register names to advertise services
    • Example: \\Alice\SQL (i.e. <node><service>)
    • Example: 128.2.2.2:80 (=http://www.foo.com/)
  • Binding
    • Clients bind to an address (e.g. name->IP address)
  • Thus the node name and IP address must failover along with the services (preserve client bindings)
client to cluster communications ip address mobility based on mac rebinding
Cluster Clients

Must use IP (TCP, UDP, NBT,... )

Must Reconnect or Retry after failure

Cluster Servers

All cluster nodes must be on same LAN segment

IP rebinds to failover MAC addr

Transparent to client or server

Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.

Client to Cluster CommunicationsIP address mobility based on MAC rebinding

Client

Alice <-> 200.110.12.4

Virtual Alice <-> 200.110.12.5

Betty <-> 200.110.12.6

Virtual Betty <-> 200.110.12.7

WAN

Alice <-> 200.110.120.4

Virtual Alice <-> 200.110.120.5

Betty <-> 200.110.120.6

Virtual Betty <-> 200.110.120.7

Router:

200.110.120.4 ->AliceMAC

200.110.120.5 ->AliceMAC

200.110.120.6 ->BettyMAC

200.110.120.7 ->BettyMAC

Local Network

slide93
Time
  • Time must increase monotonically
    • Otherwise applications get confused
    • e.g. make/nmake/build
  • Time is maintained within failover resolution
    • Not hard, since failover on order of seconds
  • Time is a resource, so one node owns time resource
  • Other nodes periodically correct drift from owner’s time
application local nt registry checkpointing
Application Local NT Registry Checkpointing
  • Resources can request that local NT registry sub-trees be replicated
  • Changes written out to quorum device
    • Uses registry change notification interface
  • Changes read and applied on fail-over

\\A on \\X

\\A on \\B

registry

registry

Each update

registry

After Failover

Quorum

Device

application support96
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC++ Wizard
  • Cluster API
generic resource dlls

Resource

Monitor

DLL

Private

calls

Resource

Std

calls

Generic Resource DLLs
  • Generic Application DLL
    • Simplest: just starts, stops application, and makes sure process is alive
  • Generic Service DLL
    • Translates DLL calls into equivalent NT Server calls
      • Online => Service Start
      • Offline => Service Stop
      • Looks/IsAlive => Service Status
application support100
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC++ Wizard
  • Cluster API
resource dll vc wizard
Resource DLL VC++ Wizard
  • Asks for resource type name
  • Asks for optional service to control
  • Asks for other parameters (and associated types)
  • Generates DLL source code
  • Source can be modified as necessary
    • E.g. additional checks for Looks/IsAlive
application support107
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC++ Wizard
  • Cluster API
cluster api
Cluster API
  • Allows resources to:
    • Examine dependencies
    • Manage per-resource data
    • Change parameters (e.g. failover)
    • Listen for cluster events
    • etc.
  • Specs & API became public Sept 1996
  • On all MSDN Level 3
  • On web site:
    • http://www.microsoft.com/clustering.htm
outline110
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • Q&A
research topics
Research Topics?
  • Even easier to manage
  • Transparent failover
  • Instant failover
  • Geographic distribution (disaster tolerance)
  • Server pools (load-balanced pool of processes)
  • Process pair (active/backup process)
  • 10,000 nodes?
  • Better algorithms
  • Shared memory or shared disk among nodes
    • a truly bad idea?
references
References

Microsoft NT site: http://www.microsoft.com/ntserver/

BARC site (e.g. these slides):http://research.microsoft.com/~joebar/wolfpack

Inside Windows NT, H. Custer, Microsoft Pr, ISBN: 155615481

Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol.

VAXclusters: a Closely Coupled Distributed System,Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster.

In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing

Transaction Processing Concepts and Techniques,Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.