slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance PowerPoint Presentation
Download Presentation
OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

Loading in 2 Seconds...

play fullscreen
1 / 204

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance - PowerPoint PPT Presentation


  • 613 Views
  • Uploaded on

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance. Keith Parris Systems / Software Engineer, HP. Pre-Conference Seminars Hosted by Encompass, your HP Enterprise Technology user group. For more information on Encompass visit www.encompassUS.org.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance' - RexAlvis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
openvms cluster internals data structures and using openvms clusters for disaster tolerance

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

Keith ParrisSystems/Software Engineer, HP

pre conference seminars hosted by encompass your hp enterprise technology user group
Pre-Conference SeminarsHosted by Encompass, your HP Enterprise Technology user group
  • For more information on Encompass visit www.encompassUS.org
part 1 openvms cluster internals data structures

Part 1:OpenVMS Cluster Internals & Data Structures

Keith ParrisSystems/Software Engineer, HP

summary of openvms cluster features
Summary of OpenVMS Cluster Features
  • Quorum Scheme to protect against partitioned clusters (a.k.a. Split-Brain Syndrome)
  • Distributed Lock Manager to coordinate access to shared resources by multiple users on multiple nodes
  • Cluster-wide File System for simultaneous access to the file system by multiple nodes
  • Common security and management environment
  • Cluster from the outside appears to be a single system
  • User environment appears the same regardless of which node they’re using
overview of topics
Overview of Topics
  • System Communication Architecture (SCA) and its guarantees
    • Interconnects
  • Connection Manager
    • Rule of Total Connectivity
    • Quorum Scheme
  • Distributed Lock Manager
  • Cluster State Transitions
  • Cluster Server Process
overview of topics7
Overview of Topics
  • MSCP/TMSCP Servers
  • Support for redundant hardware
  • Version 8.3 Changes
openvms cluster overview
OpenVMS Cluster Overview
  • An OpenVMS Cluster is a set of distributed systems which cooperate
  • Cooperation requires coordination, which requires communication
foundation for shared access
Foundation for Shared Access

Users

Application

Application

Application

Node

Node

Node

Node

Node

Node

Distributed Lock Manager

Connection Manager

Rule of Total Connectivity and Quorum Scheme

Shared resources (files, disks, tapes)

system communications architecture sca
System Communications Architecture (SCA)
  • SCA governs the communications between nodes in an OpenVMS cluster
system communications services scs
System Communications Services (SCS)
  • System Communications Services (SCS) is the name for the OpenVMS code which implements SCA
    • The terms SCA and SCS are often used interchangeably
  • SCS provides the foundation for communication between OpenVMS nodes on a cluster interconnect
cluster interconnects
Cluster Interconnects
  • SCA has been implemented on various types of hardware:
    • Computer Interconnect (CI)
    • Digital Storage Systems Interconnect (DSSI)
    • Fiber Distributed Data Interface (FDDI)
    • Ethernet (10 megabit, Fast, Gigabit)
    • Asynchronous Transfer Mode (ATM) LAN
    • Memory Channel
    • Galaxy Shared Memory Cluster Interconnect (SMCI)
interconnects storage vs cluster
Interconnects (Storage vs. Cluster)
  • Originally, CI was the one and only Cluster Interconnect for OpenVMS Clusters
    • CI allowed connection of both OpenVMS nodes and Mass Storage Control Protocol (MSCP) storage controllers
  • LANs allowed connections to OpenVMS nodes and LAN-based Storage Servers
  • SCSI and Fibre Channel allow only connections to storage – no communications to other OpenVMS nodes
    • So now we must differentiate between Cluster Interconnects and Storage Interconnects
interconnects within an openvms cluster
Interconnects within an OpenVMS Cluster
  • Storage-only Interconnects
    • Small Computer Systems Interface (SCSI)
    • Fibre Channel (FC)
  • Cluster & Storage (combination) Interconnects
    • CI
    • DSSI
    • LAN
  • Cluster-only Interconnects (No Storage)
    • Memory Channel
    • Galaxy Shared Memory Cluster Interconnect (SMCI)
    • ATM LAN
system communications architecture sca17
System Communications Architecture (SCA)
  • Each node must have a unique:
    • SCS Node Name
    • SCS System ID
  • Flow control is credit-based
system communications architecture sca18
System Communications Architecture (SCA)
  • Layers:
    • SYSAPs
    • SCS
    • Ports
    • Interconnects
lans as a cluster interconnect
LANs as a Cluster Interconnect
  • SCA is implemented in hardware by CI and DSSI port hardware
  • SCA over LANs is provided by Port Emulator software (PEDRIVER)
    • SCA over LANs is referred to as NISCA
      • NI is for Network Interconnect (an early name for Ethernet within DEC, in contrast with CI, the Computer Interconnect)
  • SCA over LANs is presently the center of gravity in OpenVMS Cluster interconnects (with storage on SANs)
scs with bridges and routers
SCS with Bridges and Routers
  • If compared with the 7-layer OSI network reference model, SCA has no Routing (what OSI calls Network) layer
  • OpenVMS nodes cannot route SCS traffic on each other’s behalf
  • SCS protocol can be bridged transparently in an extended LAN, but not routed
scs on lans
SCS on LANs
  • Because multiple independent clusters might be present on the same LAN, each cluster is identified by a unique Cluster Group Number, which is specified when the cluster is first formed.
  • As a further precaution, a Cluster Password is also specified. This helps protect against the case where two clusters inadvertently use the same Cluster Group Number. If packets with the wrong Cluster Password are received, errors are logged.
interconnect preference by scs
Interconnect Preference by SCS
  • When choosing an interconnect to a node, SCS chooses one “best” interconnect type, and sends all its traffic down that one type
      • “Best” is defined as working properly and having the most bandwidth
  • If the “best” interconnect type fails, it will fail over to another
  • OpenVMS Clusters can use multiple LAN paths in parallel
      • A set of paths is dynamically selected for use at any given point in time, based on maximizing bandwidth while avoiding paths that have high latency or that tend to lose packets
interconnect preference by scs27
Interconnect Preference by SCS
  • SCS tends to select paths in this priority order:
    • Galaxy Shared Memory Cluster Interconnect (SMCI)
    • Gigabit Ethernet
    • Memory Channel
    • CI
    • Fast Ethernet or FDDI
    • DSSI
    • 10-megabit Ethernet
  • OpenVMS also allows the default priorities to be overridden with the SCACP utility
lan packet size optimization
LAN Packet Size Optimization
  • OpenVMS Clusters dynamically probe and adapt to the maximum packet size based on what actually gets through at a given point in time
  • Allows taking advantage of larger LAN packets sizes:
      • Gigabit Ethernet Jumbo Frames
      • FDDI
scs flow control
SCS Flow Control
  • SCS flow control is credit-based
  • Connections start out with a certain number of credits
    • Credits are used as messages are sent
    • A message cannot be sent unless a credit is available
    • Credits are returned as messages are acknowledged
  • This prevents one system from over-running another system’s resources
slide30
SCS
  • SCS provides “reliable” port-to-port communications
  • SCS multiplexes messages and data transfers between nodes over Virtual Circuits
  • SYSAPs communicate viaConnectionsover Virtual Circuits
virtual circuits
Virtual Circuits
  • Formed between ports on a Cluster Interconnect of some flavor
  • Can pass data in 3 ways:
    • Datagrams
    • Sequenced Messages
    • Block Data Transfers
connections over a virtual circuit
Connections over a Virtual Circuit

Node A

Node B

VMS$VAXcluster

VMS$VAXcluster

Disk Class Driver

MSCP Disk Server

Tape Class Driver

MSCP Tape Server

Virtual Circuit

datagrams
Datagrams
  • “Fire and forget” data transmission method
  • No guarantee of delivery
      • But high probability of successful delivery
  • Delivery might be out-of-order
  • Duplicates possible
  • Maximum size typically 576 bytes
      • SYSGEN parameter SCSMAXDG (max. 985)
  • Example uses for Datagrams:
      • Polling for new nodes
      • Virtual Circuit formation
      • Logging asynchronous errors
sequenced messages
Sequenced Messages
  • Guaranteed delivery (no lost messages)
  • Guaranteed ordering (first-in, first-out delivery; same order as sent)
  • Guarantee of no duplicates
  • Maximum size presently 216 bytes
      • SYSGEN parameter SCSMAXMSG (max. 985)
  • Example uses for Sequenced Messages:
      • Lock requests
      • MSCP I/O requests, and MSCP End Messages returning I/O status
block data transfers
Block Data Transfers
  • Used to move larger amounts of bulk data (too large for a sequenced message)
  • Data is mapped into “Named Buffers”
      • which specify location and size of memory area
  • Data movement can be initiated in either direction:
      • Send Data
      • Request Data
  • Example uses for Block Data Transfers:
      • Disk or tape data transfers
      • OPCOM messages
      • Lock/Resource Tree remastering
system applications sysaps
System Applications (SYSAPs)
  • Despite the name, these are pieces of the operating system, not user applications
  • Work in pairs (one on the node at each end of the connection) for specific purposes
  • Communicate using a Connection formed over a Virtual Circuit between nodes
  • Although unrelated to OpenVMS user processes, each is given a “Process Name”
openvms cluster data structures for cluster communications using scs
OpenVMS Cluster Data Structuresfor Cluster Communications Using SCS
  • SCS Data Structures
    • SB (System Block)
    • PB (Path Block)
    • PDT (Port Descriptor Table)
    • CDT (Connection Descriptor Table)
scs data structures sb system block
SCS Data Structures:SB (System Block)
  • Data displayed by

SHOW CLUSTER/CONTINUOUS

ADD SYSTEMS/ALL

  • One SB per SCS node
    • OpenVMS node or Controller node which uses SCS
scs data structures sb system block41
SCS Data Structures:SB (System Block)
  • Data of interest:
    • SCS Node Name and SCS System ID
    • CPU hardware type/model, and HW revision level
    • Software type and version
    • Software incarnation
    • Number of paths to this node (virtual circuits)
    • Maximum sizes for datagrams and sequenced messages sent via this path
show cluster continuous display of system block sb
SHOW CLUSTER/CONTINUOUS display of System Block (SB)

View of Cluster from system ID 14733 node: ALPHA1 12-SEP-2006 23:07:02

+—————————————————————————————————————————————————————————————————————————————+

| SYSTEMS |

|————————-——————————-————————————————————————————————-——————————————-—————————|

| NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR |

|————————+——————————+————————————————————————————————+——————————————+—————————|

| ALPHA1 | 14733 | AlphaServer ES45 Model 2 | VMS V7.3-2 | 1 |

| MARVEL | 14369 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 |

| ONEU | 14488 | AlphaServer DS10L 617 MHz | VMS V7.3-2 | 1 |

| MARVL2 | 14487 | AlphaServer GS320 6/731 | VMS V8.3 | 1 |

| MARVL3 | 14486 | AlphaServer GS320 6/731 | VMS V7.3-2 | 1 |

| IA64 | 14362 | HP rx4640 (1.30GHz/3.0MB) | VMS V8.2-1 | 1 |

| MARVL4 | 14368 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 |

| WLDFIR | 14485 | AlphaServer GS320 6/731 | VMS V8.2 | 1 |

| LTLMRV | 14370 | hp AlphaServer ES47 7/1000 | VMS V8.2 | 1 |

| WILDFR | 14484 | AlphaServer GS320 6/731 | VMS V8.3 | 1 |

| AS4000 | 14423 | AlphaServer 4X00 5/466 4MB | VMS V7.3-2 | 1 |

+————————-——————————-————————————————————————————————-——————————————-—————————+

show cluster continuous display of system block sb43
SHOW CLUSTER/CONTINUOUS display of System Block (SB)

View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:28:53

+—————————————————————————————————————————————————————————————————————————+

| SYSTEMS |

|————————-——————————-————————————————————————————————-——————————-—————————|

| NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR |

|————————+——————————+————————————————————————————————+——————————+—————————|

| VX6360 | 29831 | VAX 6000-360 | VMS V7.3 | 2 |

| HSJ501 | ******** | HSJ5 | HSJ V52J | 1 |

| HSJ401 | ******** | HSJ4 | HSJ YF06 | 1 |

| HSJ402 | ******** | HSJ4 | HSJ V34J | 1 |

| HSC951 | 65025 | HS95 | HSC V851 | 1 |

| HSJ403 | ******** | HSJ4 | HSJ V32J | 1 |

| HSJ404 | ******** | HSJ4 | HSJ YF06 | 1 |

| HSJ503 | ******** | HSJ5 | HSJ V57J | 1 |

| HSJ504 | ******** | HSJ5 | HSJ V57J | 1 |

| VX6440 | 30637 | VAX 6000-440 | VMS V7.3 | 2 |

| WILDFR | 29740 | AlphaServer GS160 6/731 | VMS V7.3 | 2 |

| WLDFIR | 29768 | AlphaServer GS160 6/731 | VMS V7.3 | 2 |

| VX7620 | 30159 | VAX 7000-620 | VMS V7.3 | 2 |

| VX6640 | 29057 | VAX 6000-640 | VMS V7.3 | 2 |

| MV3900 | 29056 | MicroVAX 3900 Series | VMS V7.3 | 1 |

+————————-——————————-————————————————————————————————-——————————-—————————+

scs data structures pb path block
SCS Data Structures:PB (Path Block)
  • Data displayed by

$ SHOW CLUSTER/CONTINUOUS

ADD CIRCUITS/ALL

  • One PB per virtual circuit to a given node
scs data structures pb path block45
SCS Data Structures:PB (Path Block)
  • Data of interest:
    • Virtual circuit status
    • Number of connections over this path
    • Local port device name
    • Remote port ID
    • Hardware port type, and hardware revision level
    • Bit-mask of functions the remote port can perform
    • Load class value (so SCS can select the optimal path)
show cluster continuous display of path block pb
SHOW CLUSTER/CONTINUOUS display of Path Block (PB)

View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:51:57

+————————-———————————————————————————————————————————————————————————————————+

| SYSTEMS| CIRCUITS |

|————————+———————-———————-—————————-—————————-—————————-——————————-——————————|

| NODE | LPORT | RPORT | RP_TYPE | NUM_CON | CIR_STA | RP_REVIS | RP_FUNCT |

|————————+———————+———————+—————————+—————————+—————————+——————————+——————————|

| VX6360 | PAA0 | 13 | CIBCA-B | 0 | OPEN | 40074002 | FFFF0F00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| VX6440 | PAA0 | 15 | CIXCD | 6 | OPEN | 46 | ABFF0D00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| WILDFR | PAA0 | 14 | CIPCA | 5 | OPEN | 20 | ABFF0D00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| WLDFIR | PAA0 | 11 | CIPCA | 4 | OPEN | 20 | ABFF0D00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| VX7620 | PAA0 | 3 | CIXCD | 5 | OPEN | 47 | ABFF0D00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| VX6640 | PAA0 | 8 | CIXCD | 5 | OPEN | 47 | ABFF0D00 |

| | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

| MV3900 | PEA0 | | LAN | 5 | OPEN | 105 | 83FF0180 |

+————————-———————-———————-—————————-—————————-—————————-——————————-——————————+

show cluster continuous display of path block pb47
SHOW CLUSTER/CONTINUOUS display of Path Block (PB)

View of Cluster from system ID 14733 node: AlPHA1 12-SEP-2006 23:59:20

+————————-——————————————————————————————————————————————————————————————+

| SYSTEMS| CIRCUITS |

|————————+——————-———————-—————————-———————-—————————-——————————-————————|

| NODE | LPOR | RP_TY | CIR_STA | LD_CL | NUM_CON | RP_FUNCT | RP_OWN |

|————————+——————+———————+—————————+———————+—————————+——————————+————————|

| ALPHA1 | PEA0 | LAN | OPEN | 0 | 0 | 83FF0180 | 0 |

| MARVEL | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

| ONEU | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 |

| MARVL2 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

| MARVL3 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

| IA64 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

| MARVL4 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

| WLDFIR | PEA0 | LAN | OPEN | 200 | 4 | 83FF0180 | 0 |

| LTLMRV | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 |

| WLDFIR | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

| AS4000 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

+————————-——————-———————-—————————-———————-—————————-——————————-————————+

scs data structures pdt port descriptor table
SCS Data Structures:PDT (Port Descriptor Table)
  • Data displayed by

$ SHOW CLUSTER/CONTINUOUS

ADD LOCAL_PORTS/ALL

  • One PDT per local SCS port device
    • LANs all fit under one port: PEA0
scs data structures pdt port descriptor table49
SCS Data Structures:PDT (Port Descriptor Table)
  • Data of interest:
    • Port device name
    • Type of port
      • i.e. VAX CI, NPORT CI, DSSI, NI, MC, Galaxy SMCI
    • Local port ID, and maximum ID for this interconnect
    • Count of open connections using this port
    • Pointers to routines to perform various port functions
    • Load Class value
    • Counters of various types of traffic
    • Maximum data transfer size
show cluster continuous display of port descriptor table pdt entries
SHOW CLUSTER/CONTINUOUS display of Port Descriptor Table (PDT) entries

View of Cluster from system ID 28831 node: VX6360 12-SEP-2006 22:17:28

+———————————————————————————————————————————————————————————————————+

| LOCAL_PORTS |

|————————————-——————-————————-——————-—————-—————-————————-——————————|

| LSYSTEM_ID | NAME | LP_STA | PORT | DGS | MSG | LP_TYP | MAX_PORT |

|————————————+——————+————————+——————+—————+—————+————————+——————————|

| 29831 | PAA0 | ONLINE | 13 | 15+ | 15+ | CIBCA | 15 |

| 29831 | PEA0 | ONLINE | 0 | 15+ | 6 | LAN | 223 |

+————————————-——————-————————-——————-—————-—————-————————-——————————+

scs data structures cdt connection descriptor table
SCS Data Structures:CDT (Connection Descriptor Table)
  • Data displayed by

$ SHOW CLUSTER/CONTINUOUS

ADD CONNECTIONS/ALL

ADD COUNTERS/ALL

scs data structures cdt connection descriptor table52
SCS Data Structures:CDT (Connection Descriptor Table)
  • Data of interest:
    • Local and remote SYSAP names
    • Local and remote Connection IDs
    • Connection status
    • Pointer to System Block (SB) of the node at the other end of the connection
    • Counters of various types of traffic
show cluster continuous display of connection descriptor table cdt entries
SHOW CLUSTER/CONTINUOUS display of Connection Descriptor Table (CDT) entries

View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 22:28:42

+————————-—————————————————————————————————————-——————————————————————————+

| SYSTEMS| CONNECTIONS | COUNTERS |

|————————+——————————————————-——————————————————+—————————-—————————-——————|

| NODE | LOC_PROC_NAME | REM_PROC_NAME | MSGS_SE | MSGS_RC | CR_W |

|————————+——————————————————+——————————————————+—————————+—————————+——————|

| VX6360 | SCS$DIRECTORY | ? | | | |

| | VMS$SDA | ? | | | |

| | VMS$VAXcluster | ? | | | |

| | SCA$TRANSPORT | ? | | | |

| | MSCP$DISK | ? | | | |

| | MSCP$TAPE | ? | | | |

| VX6440 | SCA$TRANSPORT | SCA$TRANSPORT | 89219 | 193862 | 0 |

| | VMS$DISK_CL_DRVR | MSCP$DISK | 95239 | 95243 | 0 |

| | MSCP$DISK | VMS$DISK_CL_DRVR | 95251 | 95250 | 0 |

| | MSCP$TAPE | VMS$TAPE_CL_DRVR | 1246 | 1246 | 0 |

| | VMS$TAPE_CL_DRVR | MSCP$TAPE | 1242 | 1247 | 0 |

| | VMS$VAXcluster | VMS$VAXcluster | 2792737 | 2316397 | 211 |

| WILDFR | MSCP$TAPE | VMS$TAPE_CL_DRVR | 1994 | 1994 | 0 |

| | MSCP$DISK | VMS$DISK_CL_DRVR | 152618 | 152617 | 0 |

| | VMS$DISK_CL_DRVR | MSCP$DISK | 985269 | 985274 | 0 |

| | VMS$TAPE_CL_DRVR | MSCP$TAPE | 1990 | 1990 | 0 |

| | VMS$VAXcluster | VMS$VAXcluster | 227378 | 439169 | 0 |

connection manager
Connection Manager
  • The Connection Manager is code within OpenVMS that coordinates cluster membership across events such as:
    • Forming a cluster initially
    • Allowing a node to join the cluster
    • Cleaning up after a node which has failed or left the cluster
  • all the while protecting against uncoordinated access to shared resources such as disks
rule of total connectivity
Rule of Total Connectivity
  • Every system must be able to talk “directly” with every other system in the cluster
    • Without having to go through another system
    • Transparent LAN bridges are considered a “direct” connection
  • If the Rule of Total Connectivity gets broken, the cluster will need to be adjusted so all members can once again talk with one another and meet the Rule.
quorum scheme
Quorum Scheme
  • The Connection Manager enforces the Quorum Scheme to ensure that all access to shared resources is coordinated
    • Basic idea: A majority of the potential cluster systems must be present in the cluster before any access to shared resources (i.e. disks) is allowed
  • Idea comes from familiar parliamentary procedures
    • As in human parliamentary procedure, requiring a quorum before doing business prevents two or more subsets of members from meeting simultaneously and doing conflicting business
quorum scheme58
Quorum Scheme
  • Systems (and sometimes disks) are assigned values for the number of votes they have in determining a majority
  • The total number of votes possible is called the “Expected Votes” – the number of votes to be expected when all cluster members are present
  • “Quorum” is defined to be a simple majority (just over half) of the total possible (the Expected) votes
quorum schemes
Quorum Schemes
  • In the event of a communications failure,
    • Systems in the minority voluntarily suspend processing, while
    • Systems in the majority can continue to process transactions
quorum scheme60
Quorum Scheme
  • If a cluster member is not part of a cluster with quorum, OpenVMS keeps it from doing any harm by:
    • Putting all disks into Mount Verify state, thus stalling all disk I/O operations
    • Requiring that all processes can only be scheduled to run on a CPU with the QUORUM capability bit set
    • Clearing the QUORUM capability bit on all CPUs in the system, thus preventing any process from being scheduled to run on a CPU and doing any work
quorum schemes61
Quorum Schemes
  • To handle cases where there are an even number of votes
    • For example, with only 2 systems,
    • Or half of the votes in the cluster are at each of 2 geographically-separated sites
  • provision may be made for
    • a tie-breaking vote, or
    • human intervention
quorum schemes tie breaking vote
Quorum Schemes:Tie-breaking vote
  • This can be provided by a disk:
    • Quorum Disk
  • Or an extra system with a vote:
    • Additional cluster member node, called a “Quorum Node”
quorum scheme63
Quorum Scheme
  • A “quorum disk” can be assigned votes
    • OpenVMS periodically writes cluster membership info into the QUORUM.DAT file on the quorum disk and later reads it to re-check it; if all is well, OpenVMS can treat the quorum disk as a virtual voting member of the cluster
    • Two cluster subsets will “see” each other through their modifications of the QUORUM.DAT file as they poll its contents, and refuse to count the quorum disk’s votes
quorum loss
Quorum Loss
  • If too many systems leave the cluster, there may no longer be a quorum of votes left
  • It is possible to manually force the cluster to recalculate quorum and continue processing if needed
quorum recovery methods
Quorum Recovery Methods
  • Software interrupt at IPL 12 from console
    • IPC> Q
  • Availability Manager or DECamds:
    • System Fix; Adjust Quorum
  • DTCS or BRS integrated tool, using same RMDRIVER (data provider client) interface as Availability Manager and DECamds
quorum scheme66
Quorum Scheme
  • If two non-cooperating subsets of cluster nodes both achieve quorum and access shared resources, this is known as a “partitioned cluster” (a.k.a. “Split Brain Syndrome”)
  • When a partitioned cluster occurs, the disk structure on shared disks is quickly corrupted
quorum scheme67
Quorum Scheme
  • Avoid a partitioned cluster by:
    • Proper setting of EXPECTED_VOTES parameter to the total of all possible votes
      • Note: OpenVMS ratchets up the dynamic cluster-wide value of Expected Votes as votes are added
      • When temporarily lowering EXPECTED_VOTES at boot time, also set WRITESYSPARAMS to zero to avoid writing a low value of EXPECTED_VOTES to the Current SYSGEN parameters
    • Make manual quorum adjustments with care:
      • Availability Manager software (or DECamds)
      • System SHUTDOWN with REMOVE_NODE option
connection manager and transient failures
Connection Manager and Transient Failures
  • Some communications failures are temporary and transient
    • Especially in a LAN environment
  • To prevent the disruption of unnecessary removal of a node from the cluster, when a communications failure is detected, the Connection Manager waits for a time in hopes of the problem going away by itself
    • This time is called the Reconnection Interval
      • SYSGEN parameter RECNXINTERVAL
        • RECNXINTERVAL is dynamic and may thus be temporarily raised if needed for something like a scheduled LAN outage
connection manager and communications or node failures
Connection Manager and Communications or Node Failures
  • If the Reconnection Interval passes without connectivity being restored, or if the node has “gone away”, the cluster cannot continue without a reconfiguration
  • This reconfiguration is called a State Transition, and one or more nodes will be removed from the cluster
optimal sub cluster selection
Optimal Sub-cluster Selection
  • Connection manager compares potential node subsets that could make up surviving portion of the cluster
  • Pick sub-cluster with the most votes
  • If votes are tied, pick sub-cluster with the most nodes
  • If nodes are tied, arbitrarily pick a winner
    • based on comparing SCSSYSTEMID values of the set of nodes with the most-recent cluster software revision
lan reliability
LAN reliability
  • VOTES:
    • Most configurations with satellite nodes give votes to disk/boot servers and set VOTES=0 on all satellite nodes
      • If the sole LAN adapter on a disk/boot server fails, and it has a vote, ALL satellites will leave the cluster
    • Advice: give at least as many votes to node(s) on the LAN as any single server has, or configure redundant LAN adapters
lan redundancy and votes74
LAN redundancy and Votes

Subset A

0

0

0

1

1

Subset B

Which subset of nodes is selected as the optimal sub-cluster?

lan redundancy and votes75
LAN redundancy and Votes

0

0

0

1

1

One possible solution: redundant LAN adapters on servers

lan redundancy and votes76
LAN redundancy and Votes

1

1

1

2

2

Another possible solution: Enough votes on LAN to outweigh any single server node

optimal sub cluster selection example two site cluster with unbalanced votes78
Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

1

1

0

0

Shadowsets

Which subset of nodes is selected as the optimal sub-cluster?

optimal sub cluster selection example two site cluster with unbalanced votes79
Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

Nodes at this site continue

Nodes at this site CLUEXIT

1

1

0

0

Shadowsets

Without nodes to MSCP-serve disks, these shadowset members go away

optimal sub cluster selection example two site cluster with unbalanced votes80
Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

2

2

1

1

Shadowsets

One possible solution

openvms cluster data structures for the connection manager
OpenVMS Cluster Data Structuresfor the Connection Manager
  • Connection Manager Data Structures
    • CLUB (Cluster Block)
    • CSB (Cluster System Block)
    • CSV (Cluster System Vector)
connection manager data structures club cluster block
Connection Manager Data Structures: CLUB (Cluster Block)
  • Only one CLUB on a given node
  • Pointed to by CLU$GL_CLUB
  • Data displayed by

$ SHOW CLUSTER/CONTINUOUS

ADD CLUSTER/ALL

connection manager data structures club cluster block84
Connection Manager Data Structures: CLUB (Cluster Block)
  • Data of interest:
    • Number of cluster members
    • Total number of votes
    • Current cluster-wide effective values of EXPECTED_VOTES and QUORUM
    • Various flags, like whether the cluster has quorum or not
    • Pointer to the Cluster System Block (CSB) for the local system
    • Data used during state transitions, like the identity of State Transition Coordinator Node, and what phase the transition is in
    • Time the cluster was formed, and time of last state transition
show cluster continuous display of cluster block club
SHOW CLUSTER/CONTINUOUS display of Cluster Block (CLUB)

View of Cluster from system ID 14733 node: ALPHA1 13-SEP-2006 00:59:49

+—————————————————————————————————————————————————————————————————————————————+

| CLUSTER |

|————————-————————-——————————-————————-———————————————————-———————————————————|

| CL_EXP | CL_QUO | CL_VOTES | CL_MEM | FORMED | LAST_TRANSITION |

|————————+————————+——————————+————————+———————————————————+———————————————————|

| 10 | 6 | 10 | 11 | 17-MAY-2006 14:48 | 11-SEP-2006 15:18 |

+————————-————————-——————————-————————-———————————————————-———————————————————+

connection manager data structures csb cluster system block
Connection Manager Data Structures:CSB (Cluster System Block)
  • Data displayed by

$ SHOW CLUSTER/CONTINUOUS

ADD MEMBERS/ALL

  • One CSB for each OpenVMS node this node is aware of
connection manager data structures csb cluster system block87
Connection Manager Data Structures: CSB (Cluster System Block)
  • Data of interest:
    • Cluster System ID (CSID) of this node
    • Status of this node (e.g. cluster member or not)
    • State of our connection to this node
      • e.g. Newly discovered node, normal connection, communications lost but still within RECNXINTERVAL wait period, lost connection too long so node is no longer in cluster
    • This node’s private values for VOTES, EXPECTED_VOTES, QDSKVOTES, LOCKDIRWT
    • Cluster software version, and bit-mask of capabilities this node has (to allow support of cluster rolling upgrades despite adding new features)
show cluster continuous display of cluster block club88
SHOW CLUSTER/CONTINUOUS display of Cluster Block (CLUB)

View of Cluster from system ID 14733 node: ALPHA1 13-SEP-2006 01:09:00

+——————————————————————————————————————————————————————————————————————————————+

| MEMBERS |

|————————-————————-———————-—————-————————-———————————————-—————————-———————————|

| CSID | DIR_WT | VOTES | EXP | QUORUM | RECNXINTERVAL | SW_VERS | BITMAPS |

|————————+————————+———————+—————+————————+———————————————+—————————+———————————|

| 10061 | 1 | 1 | 1 | 1 | 60 | 6 | MINICOPY |

| 1005E | 1 | 1 | 6 | 4 | 60 | 6 | MINICOPY |

| 10032 | 0 | 0 | 5 | 3 | 60 | 6 | MINICOPY |

| 1005D | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

| 10057 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

| 10064 | 1 | 1 | 10 | 6 | 60 | 6 | MINICOPY |

| 1005F | 1 | 1 | 6 | 4 | 60 | 6 | MINICOPY |

| 10058 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

| 10060 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

| 1005A | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

| 10043 | 1 | 1 | 5 | 3 | 60 | 6 | MINICOPY |

+————————-————————-———————-—————-————————-———————————————-—————————-———————————+

connection manager data structures csv cluster system vector
Connection Manager Data Structures: CSV (Cluster System Vector)
  • The CSV is effectively a list of all systems
  • The Cluster System ID (CSID) acts as an index into the CSV
  • Incrementing sequence numbers within CSIDs allows re-use of CSV slots without confusion
    • This is similar to how the sequence number is incremented before a File ID is re-used
distributed lock manager
Distributed Lock Manager
  • The Lock Manager provides mechanisms for coordinating access to physical devices, both for exclusive access and for various degrees of sharing
distributed lock manager92
Distributed Lock Manager
  • Examples of physical resources:
    • Tape drives
    • Disks
    • Files
    • Records within a file
  • as well as internal operating system cache buffers and so forth
distributed lock manager93
Distributed Lock Manager
  • Physical resources are mapped to symbolic resource names, and locks are taken out and released on these symbolic resources to control access to the real resources
distributed lock manager94
Distributed Lock Manager
  • System services $ENQ and $DEQ allow new lock requests, conversion of existing locks to different modes (or degrees of sharing), and release of locks, while $GETLKI allows the lookup of lock information
openvms cluster distributed lock manager
OpenVMS ClusterDistributed Lock Manager
  • Physical resources are protected by locks on symbolic resource names
  • Resources are arranged in trees:
    • e.g. File  Data bucket  Record
  • Different resources (disk, file, etc.) are coordinated with separate resource trees, to minimize contention
symbolic lock resource names
Symbolic lock resource names
  • Symbolic resource names
    • Common prefixes:
      • SYS$ for OpenVMS executive
      • F11B$ for XQP, file system
      • RMS$ for Record Management Services
    • See the book OpenVMS Internals and Data Structures by Ruth Goldenberg, et al
      • Appendix H in Alpha V1.5 version
      • Appendix A in Alpha V7.0 version
resource names
Resource names
  • Example: Device Lock
    • Resource name format is
      • “SYS$” {Device Name in ASCII text}

3A3333324147442431245F24535953 SYS$_$1$DGA233:

SYS$  (SYS$ facility)

_$1$DGA233:  (Device name)

resource names98
Resource names
  • Example: RMS lock tree for an RMS indexed file:
    • Resource name format is
      • “RMS$” {File ID} {Flags byte} {Lock Volume Name}
    • Identify filespec using File ID
    • Flags byte indicates shared or private disk mount
    • Pick up disk volume name
      • This is label as of time disk was mounted
  • Sub-locks are used for buckets and records within the file
decoding an rms file root resource name
Decoding an RMS File Root Resource Name

RMS$t......FDDI_COMMON ...

000000 204E4F4D4D4F435F49444446 02 000000011C74 24534D52

24534D52 RMS$  (RMS Facility; RMS File Root Resource)

00 00 0001 1C74

RVN FilX Sequence_Number File_Number

 File ID [7284,1,0])

02

Flags byte (Disk is mounted /SYSTEM)

204E4F4D4D4F435F49444446 FDDI_COMMON  (Disk label)

$ dump/header/id=7284/block=count=0 –

disk$FDDI_COMMON:[000000]indexf.sys

Dump of file _DSA100:[SYSEXE]SYSUAF.DAT;1 ...  (File)

rms data bucket contents
RMS Data Bucket Contents

Data Bucket

Data Record

Data Record

Data Record

Data Record

Data Record

Data Record

Data Record

Data Record

Data Record

Data Record

rms indexed file bucket and record locks
RMS Indexed FileBucket and Record Locks
  • Sub-locks of RMS File Lock
    • Have to look at Parent lock to identify file
  • Bucket lock:
    • 4 bytes: VBN of first block of the bucket
  • Record lock:
    • 8 bytes: Record File Address (RFA) of record
distributed lock manager locking modes
Distributed Lock Manager:Locking Modes
  • Different types of locks allow different levels of sharing:
    • EX: Exclusive access: No other simultaneous access allowed
    • PW: Protected Write: Allows writing, with other read-only users allowed
    • PR: Protected Read: Allows reading, with other read-only users; no write access allowed
    • CW: Concurrent Write: Allows writing while others write
    • CR: Concurrent Read: Allows reading while others write
    • NL: Null (future interest in the resource)
  • Locks can be requested, released, and converted between modes
distributed lock manager lock master nodes
Distributed Lock Manager:Lock Master nodes
  • OpenVMS assigns a single node at a time to keep track of all the resources in a given resource tree, and any locks taken out on those resources
    • This node is called the Lock Master node for that tree
    • Different trees often have different Lock Master nodes
    • OpenVMS dynamically moves Lock Mastership duties to the node with the most locking activity on that tree
distributed locks
Distributed Locks

Node A

Lock Master

for resource X

Node B

Node C

Lock on

resource X

Lock on

resource X

Lock on

resource X

Copy of Node B’s

Lock on

resource X

Copy of Node C’s

Lock on

resource X

directory lookups
Directory Lookups
  • This is how OpenVMS finds out which node is the lock master
  • Only needed for 1st lock request on a particular resource tree on a given node
    • Resource Block (RSB) remembers master node CSID
  • Basic conceptual algorithm: Hash resource name and index into the lock directory weight vector, which has been created based on LOCKDIRWT values for each node
lock directory weight vector
Lock Directory Weight Vector

Big node: PAPABR Lock Directory Weight = 2

Middle-sized node: MAMABR Lock Directory Weight = 1

Satellite node: BABYBR Lock Directory Weight = 0

Resulting lock directory weight vector:

PAPABR

PAPABR

MAMABR

lock directory lookup
Lock Directory Lookup

Local node

User lock

Request

User request

satisfied

Directory node

Lock Master Node

performance hints
Performance Hints
  • Avoid $DEQing a lock which will be re-acquired later
    • Instead, convert to Null lock, and convert back up later
    • Or, create locks as sub-locks of a parent lock that remains held
  • This avoids the need for directory lookups
performance hints111
Performance Hints
  • For lock trees which come and go often:
    • A separate program could take out a Null lock on the root resource on each node at boot time, and just hold it “forever”
    • This avoids lock directory lookup operations for that tree
lock request latencies
Lock Request Latencies
  • Latency depends on several things:
    • Directory lookup needed or not
      • Local or remote directory node
    • $ENQ or $DEQ operation (acquiring or releasing a lock)
    • Local (same node) or remote lock master node
      • And if remote, the speed of interconnect used
lock request latency local vs remote
Lock Request Latency: Local vs. Remote
  • Local requests are fastest
  • Remote requests are significantly slower:
    • Code path ~20 times longer
    • Interconnect also contributes latency
    • Total latency up to 2 orders of magnitude (100x) higher than local requests
lock request latency measurement
Lock Request Latency Measurement
  • I used the LOCKTIME program pair written by Roy G. Davis, author of the book VAXcluster Principles, to measure latency of lock requests locally and across different interconnects
  • LOCKTIME algorithm:
    • Take out 5000 locks on remote node, making it lock master for each of 5000 lock trees (requires ENQLM > 5000)
    • Do 5000 $ENQs, lock converts, and $DEQs, and calculate average latencies for each
    • Lock conversion request latency is roughly equivalent to round-trip time between nodes over a given interconnect
  • See LOCKTIME.COM from OpenVMS Freeware V6, [KP_LOCKTOOLS] directory
lock request latency local
Lock Request Latency: Local

Client process on same node:2-4 microseconds

Lock Master Node

Client

client across gigabit ethernet 200 microseconds
Client across Gigabit Ethernet:200 microseconds

Lock Request Latency: Remote

Lock Master

Client node

Client

GbE

Switch

lock mastership
Lock Mastership
  • Lock mastership node may change for various reasons:
    • Lock master node goes down -- new master must be elected
    • OpenVMS may move lock mastership to a “better” node for performance reasons
      • LOCKDIRWT imbalance found, or
      • Activity-based Dynamic Lock Remastering
      • Lock Master node no longer has interest
lock mastership119
Lock Mastership
  • Lock master selection criteria:
    • Interest
      • Only move resource tree to a node which is holding at least some locks on that resource tree
    • Lock Directory Weight (LOCKDIRWT)
      • Move lock tree to a node with interest and a higher LOCKDIRWT
    • Activity Level
      • Move lock tree to a node with interest and a higher average activity level
how to measure locking activity
How to measure locking activity
  • OpenVMS keeps counters of lock activity for each resource tree
    • but not for each of the sub-resources
  • So you can see the lock rate for an RMS indexed file, for example
    • but not for individual buckets or records within that file
  • SDA extension LCK in OpenVMS V7.2-2 and above can show lock rates, and even trace all lock requests if needed. This displays data on a per-node basis.
  • Cluster-wide summary is available using LOCK_ACTV*.COM from OpenVMS Freeware V6, [KP_LOCKTOOLS] directory
lock remastering
Lock Remastering
  • Circumstances under which remastering occurs, and does not:
    • LOCKDIRWT values
      • OpenVMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT
    • Shifting initiated based on activity counters in root RSB
      • PE1 parameter being non-zero can prevent movement or place threshold on lock tree size
    • Shift if existing lock master loses interest
lock remastering122
Lock Remastering
  • OpenVMS rules for dynamic remastering decision based on activity levels:
      • assuming equal LOCKDIRWT values
    • 1) Must meet general threshold of at least 80 lock requests so far (LCK$GL_SYS_THRSH)
    • 2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH)
lock remastering123
Lock Remastering
  • OpenVMS rules for dynamic remastering:
    • 3) Estimated cost to move (based on size of lock tree) must be less than estimated savings (based on lock rate)
      • except if new master meets criteria (2) for 3 consecutive 8-second intervals, cost is ignored
    • 4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA)
lock remastering124
Lock Remastering
  • OpenVMS rules for dynamic remastering:
    • 5) If PE1 on the current master has a negative value, remastering trees off the node is disabled
    • 6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered
lock remastering125
Lock Remastering
  • Implications of dynamic remastering rules:
    • LOCKDIRWT must be equal for lock activity levels to control choice of lock master node
    • PE1 can be used to control movement of lock trees OFF of a node, but not ONTO a node
    • RSB stores lock activity counts, so even high activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated
lock remastering126
Lock Remastering
  • Implications of dynamic remastering rules:
    • With two or more nodes of equal horsepower running the same application, lock mastership “thrashing” can occur:
      • 10 more lock requests per second is not much of a difference when you may be doing 100s or 1,000s of lock requests per second
      • Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload
how to detect lock mastership thrashing
How to Detect Lock Mastership Thrashing
  • Detection of remastering activity
    • MONITOR RLOCK in 7.3 and above (not 7.2-2)
    • SDA> SHOW LOCK/SUMMARY in 7.2 and above
    • Change of mastership node for a given resource
    • Check message counters under SDA:
      • SDA> EXAMINE PMS$GL_RM_RBLD_SENT
      • SDA> EXAMINE PMS$GL_RM_RBLD_RCVD
        • Counts which increase suddenly by a large amount indicate remastering of large tree(s)
          • SENT: Off of this node
          • RCVD: Onto this node
        • See example procedures WATCH_RBLD.COM and RBLD.COM
sda show lock summary
SDA> SHOW LOCK/SUMMARY

$ analyze/system

OpenVMS (TM) Alpha system analyzer

SDA> show lock/summary

...

Lock Manager Performance Counters:

----------------------------------

...

Lock Remaster Counters:

Tree moved to this node 0

Tree moved to another node 0

Tree moved due to higher Activity 0

Tree moved due to higher LOCKDIRWT 0

Tree moved due to Single Node Locks 0

No Quota for Operation 0

Proposed New Manager Declined 0

Operations completed 0

Remaster Messages Sent 0

Remaster Messages Received 0

Remaster Rebuild Messages Sent 0

Remaster Rebuild Messages Received 0

monitor rlock
MONITOR RLOCK

OpenVMS Monitor Utility

DYNAMIC LOCK REMASTERING STATISTICS

on node KEITH

6-MAY-2002 18:42:35.07

CUR AVE MIN MAX

Lock Tree Outbound Rate 0.00 0.00 0.00 0.00

(Higher Activity) 0.00 0.00 0.00 0.00

(Higher LCKDIRWT) 0.00 0.00 0.00 0.00

(Sole Interest) 0.00 0.00 0.00 0.00

Remaster Msg Send Rate 0.00 0.00 0.00 0.00

Lock Tree Inbound Rate 0.00 0.00 0.00 0.00

Remaster Msg Receive Rate 0.00 0.00 0.00 0.00

how to prevent lock mastership thrashing
How to Prevent Lock Mastership Thrashing
  • Unbalanced node power
  • Unequal workloads
  • Unequal values of LOCKDIRWT
  • Non-zero values of PE1
distributed lock manager resiliency
Distributed Lock Manager: Resiliency
  • The Lock Master node for a given resource tree knows about all locks on resources in that tree from all nodes
  • Each node also keeps track of its own locks

Therefore:

  • If the Lock Master node for a given resource tree fails,
    • OpenVMS moves Lock Mastership duties to another node, and each node with locks tells the new Lock Master node about all their existing locks
  • If any other node fails, the Lock Master node frees (releases) any locks the now-departed node may have held
distributed lock manager scalability
Distributed Lock Manager: Scalability
  • For the first lock request on a given resource tree from a given node, OpenVMS hashes the resource name to pick a node (called the Directory node) to ask about locks
  • If the Directory Node does not happen to be the Lock Master node for that tree, it replies telling which node IS the Lock Master node for that tree
  • As long as a node holds ANY locks on a resource tree, it remembers which node is the Lock Master for the tree, so it can avoid doing directory lookups for that tree
distributed lock manager scalability133
Distributed Lock Manager: Scalability
  • Thus, regardless of the number of nodes in a cluster, it takes AT MOST two off-node requests to resolve a lock request:
    • More often, only one request is needed (because we already know which node is the Lock Master), and
    • The majority of the time, NO off-node requests are needed, because OpenVMS tends to move Lock Master duties to the node with the most locking activity on the tree
  • Because Lock Directory and Lock Mastership duties are distributed among the nodes, this helps keep any single node from becoming a bottleneck in locking operations
deadlock searches
Deadlock searches
  • The OpenVMS Distributed Lock Manager automatically detects lock deadlock conditions, and generates an error to one of the programs causing the deadlock
deadlock searches135
Deadlock searches
  • Deadlock searches can take lots of time and interrupt-state CPU time
  • MONITOR, T4, DECps, etc. can identify when deadlock searches are occurring
  • DEADLOCK_WAIT parameter controls how long we wait before starting a deadlock search
dedicated cpu lock manager
Dedicated-CPU Lock Manager
  • With 7.2-2 and above, you can choose to dedicate a CPU to do lock management work. This may help reduce MP_Synch time.
  • Using this can be helpful when:
    • You have more than 5 CPUs in the system, and
    • You’re already wasting more than a CPU’s worth in MP_Synch time contending for the LCKMGR spinlock
      • See SDA> SPL extension and SYS$EXAMPLES:SPL.COM
  • LCKMGR_MODE parameter:
    • 0 = Disabled
    • >1 = Enable if at least this many CPUs are running
  • LCKMGR_CPUID parameter specifies which CPU to dedicate to LCKMGR_SERVER process
openvms cluster data structures for the distributed lock manager
OpenVMS Cluster Data Structuresfor the Distributed Lock Manager
  • Distributed Lock Manager
    • RSB (Resource Block)
    • LKB (Lock Block)
  • Data displayed by

$ ANALYZE/SYSTEM

SDA> SHOW RESOURCE

SDA> SHOW LOCK

distributed lock manager data structures rsb resource block
Distributed Lock Manager Data Structures:RSB (Resource Block)
  • Data of interest:
    • Resource name and length
    • UIC Group (0 means System)
    • Access Mode (Kernel, Executive, Supervisor, User)
    • Resource Master (aka Lock Master) node’s CSID (0=local mastership)
    • Count of locks on this resource
    • Parent resource block, and depth within tree (0 for a Root RSB)
    • Count and queue of sub-resources, if any
    • Hash value for lookups via resource hash table
    • Activity counters used for dynamic lock remastering
distributed lock manager data structures lkb lock block
Distributed Lock Manager Data Structures:LKB (Lock Block)
  • Data of interest:
    • Pointer to associated resource block (RSB)
    • Lock mode (Kernel, Executive, Supervisor, User)
    • Process ID
    • State (e.g. granted, waiting, in conversion queue)
    • Parent lock, if any
    • Number of sub-locks
    • Lock ID of this lock
    • CSID of lock master node, and remote Lock ID there
progression of lock manager improvements
Progression of Lock Manager Improvements

VAX/VMS version 4.0:

  • Introduced VAXclusters and Distributed Lock Manager
progression of lock manager improvements143
Progression of Lock Manager Improvements

VAX/VMS version 5.0:

  • Problem recognized:
    • In every cluster state transition, a Full Rebuild operation of the cluster-wide distributed lock database was required -- basically throwing away all Master copies of locks, selecting new resource masters, and then re-acquiring all locks, rebuilding the lock directory information in the process. This could take multiple minutes to complete.
  • Solution:
    • Partial and Merge rebuilds introduced
progression of lock manager improvements144
Progression of Lock Manager Improvements

VAX/VMS version 5.2:

  • Problem recognized:
    • Resource mastership remains stuck on one node.
      • Even if that node lost all interest in a resource tree, as long as there was even 1 lock anywhere in the cluster, that node remained the resource master.
      • As nodes in a cluster rebooted one by one, the remaining nodes tended to accumulate resource mastership duties over time, resulting in unbalanced loads
  • Solution:
    • Resource tree remastering is introduced. Resource trees can now be moved between nodes. LOCKDIRWT parameter allows bias or “weighting” toward or away from specific nodes.
progression of lock manager improvements145
Progression of Lock Manager Improvements

VAX/VMS version 5.4:

  • Problem recognized:
    • Remastering of resource trees after a node leaves the cluster adds to the impact of state transitions
  • Solution:
    • Resources trees are proactively remastered away from a node before it completes shutdown, minimizing the cleanup work afterward
progression of lock manager improvements146
Progression of Lock Manager Improvements

VAX/VMS version 5.4:

  • Problem recognized:
    • Lock rebuild operation is single-threaded, stretching out lock rebuild times and increasing the adverse impact of a cluster state transition
  • Solution:
    • Lock rebuild operation is made to be multi-threaded
progression of lock manager improvements147
Progression of Lock Manager Improvements

OpenVMS version 5.5:

  • Problem recognized:
    • Lock-request latency is lowest on the node which is the resource master, but VMS had no way of moving a resource tree to a busier node based on relative lock-request activity levels
  • Solution:
    • Activity-based resource tree remastering is introduced
progression of lock manager improvements148
Progression of Lock Manager Improvements

OpenVMS version 5.5-2:

  • Problem recognized:
    • Original algorithms were a bit too eager to remaster resource trees
  • Solution:
    • Scans for remastering activity are now done once every 8 seconds instead of once per second
progression of lock manager improvements149
Progression of Lock Manager Improvements

OpenVMS Alpha version 7.2:

  • Problem recognized:
    • System (S0/S1) address space is now often in short supply. Lock Manager data structures sometimes take up half of system address space on customers’ systems.
  • Solution:
    • Data structures for lock manager are moved into S2 (64-bit) address space.
progression of lock manager improvements150
Progression of Lock Manager Improvements

OpenVMS version 7.3 (and 7.2-2):

  • Problem recognized:
    • Remastering of large resource trees can take many seconds, even minutes. It can take tens of thousands of SCS sequenced messages to complete a remastering operation. Locking is blocked during a remastering operation, so that adversely impacts customer response times.
  • Solution:
    • Resource tree remastering is modified so it can use block data transfers instead of sequenced messages. This results in a reduction in resource tree remastering times by a factor of 10 to 20.
progression of lock manager improvements151
Progression of Lock Manager Improvements

OpenVMS version 8.3:

  • Problem recognized:
    • In clusters with multiple nodes of the same horsepower running the same applications, lock tree mastership can “thrash” around among the nodes.
      • One major cause is a hard-coded constant of only 10 lock requests per second needed to trigger a remastering operation.
  • Solution:
    • Algorithms for activity-based remastering are improved to add hysteresis and allow control with a new SYSGEN parameter LOCKRMWT.
progression of lock manager improvements152
Progression of Lock Manager Improvements

OpenVMS version 8.3:

  • Problem recognized:
    • LOCKDIRWT parameter got overloaded as of version 5.2 with two meanings:
      • Lock Directory Weighting – relative amount of lock directory work this node should perform
      • Lock Remastering Weighting – relative priority of this node in mastering shared resource trees
  • Solution:
    • LOCKRMWT parameter introduced to cover (2). LOCKDIRWT returns to its original meaning (1).
progression of lock manager improvements153
Progression of Lock Manager Improvements

OpenVMS version 8.3:

  • Problem recognized:
    • Some applications (e.g. Rdb) encounter deadlocks on a regular basis as a part of normal operation. The minimum value of 1 second for the DEADLOCK_WAIT parameter means it can take too long to detect these deadlocks.
  • Solution:
    • The deadlock waiting period can be set on a per-process basis to as little as one-hundredth of a second if needed.
cluster state transition topics
Cluster State Transition Topics
  • Purpose of cluster state transitions
  • The Distributed Lock Manager’s role in State Transitions
  • Things which trigger state transitions, and how to avoid them
  • Failures and failure detection mechanisms
  • Reconnection Interval
cluster state transition topics156
Cluster State Transition Topics
  • Effects of a state transition
  • Sequence of events in a state transition
  • State-transition-related issues
  • Minimizing impact of state transitions
distributed lock manager and cluster state transitions
Distributed Lock Manager and Cluster State Transitions
  • The Distributed Lock Manager keeps information about locks and resources distributed around the cluster
  • Lock and Resource data may be affected by a node joining or leaving the cluster:
    • A node leaving may have had locks which need to be freed
    • Lock Master duties may need to be reassigned
    • Lock Directory Lookup distribution may also be affected when a node joins or leaves
lock rebuild types
Lock Rebuild Types
  • When a node joins or leaves the cluster, the distributed resource/lock database needs to be rebuilt.
  • Lock requests may need to be temporarily stalled while the data is redistributed
  • There are 4 types of lock rebuild operations:
    • Merge
    • Partial
    • Directory
    • Full
lock rebuild types160
Lock Rebuild Types
  • Merge:
    • Enter already-held lock tree into directory on directory node
      • Lock requests not stalled
  • Partial:
    • Release locks held by departed node, and remaster any resource trees it was resource master for
lock rebuild types161
Lock Rebuild Types
  • Directory:
    • Do what a Partial rebuild does (clean up after departed node), and also rebuild the directory information
  • Full:
    • Throw away all master copies of locks, select new resource masters, and then re-acquire locks, rebuilding directory information in the process
state transition triggers
State Transition Triggers
  • A state transition is needed when:
    • A new member is added to the cluster
    • An existing member is removed from the cluster
    • An adjustment related to votes is needed:
      • shutdown with REMOVE_NODE option
      • quorum disk validation or invalidation
      • quorum is adjusted manually via:
        • IPC> Q routine
        • Availability Manager / DECamds Quorum Fix via RMDRIVER
        • $SET CLUSTER/EXPECTED_VOTES command in DCL
new member detection on ci or dssi
New-member detectionon CI or DSSI

PANUMPOLL = 2

PAMAXPORT = 3

PAPOLLINTERAL = 5

Node Polled:

Time t=0

Request ID

Node ID 0

ID response

Request ID

PAPOLLINTERVAL

Node ID 1

ID response

Time t=5

Request ID

Node ID 2

ID response

Request ID

PAPOLLINTERVAL

Node ID 3

ID response

Time t=10

Request ID

Node ID 0

ID response

virtual circuit formation on ci or dssi
Virtual Circuit Formationon CI or DSSI

Cluster member

Request ID

New node

ID response

Request ID

ID response

Start

Start Acknowledge

Acknowledge

new member detection on ethernet or fddi
New-member detectionon Ethernet or FDDI

Remote node

Local node

Hello or Solicit-Service

Channel-Control Handshake

Channel-Control Start

Verify

Verify Acknowledge

Start

SCS Handshake

Start Acknowledge

Acknowledge

failures
Failures
  • Types of failures:
    • Node failure
    • Adapter failure
    • Interconnect failure
  • Failure examples:
    • Power supply failure in network equipment
    • Operating system crash
    • Bridge reboot for firmware update, and spanning-tree reconfiguration which results
failure and repair recovery within reconnection interval
Failure and Repair/Recovery within Reconnection Interval

Time

Failure occurs

Failure detected

(virtual circuit

broken)

Problem fixed

RECNXINTERVAL

Fixed state detected

(virtual circuit

re-opened)

hard failure
Hard Failure

Time

Failure occurs

Failure detected

(virtual circuit

broken)

RECNXINTERVAL

State transition

(node removed

from cluster)

late recovery
Late Recovery

Time

Failure occurs

Failure detected

(virtual circuit

broken)

RECNXINTERVAL

State transition

(node removed

from cluster)

Problem fixed

Fix detected

Node learns it has been removed from cluster

Node does CLUEXIT

bugcheck

failure detection mechanisms
Failure Detection Mechanisms
  • Mechanisms to detect a node or communications failure
    • Last-Gasp Datagram
    • Periodic checking
      • Multicast Hello packets on LANs
      • Polling on CI and DSSI
      • TIMVCFAIL check
timvcfail mechanism
TIMVCFAIL Mechanism

Local node

Remote node

Time t=0

Request

Response

Time t=1/3 of TIMVCFAIL

Time t=2/3 of TIMVCFAIL

Request

Response

timvcfail mechanism173
TIMVCFAIL Mechanism

Local node

Remote node

Time t=0

Request

Response

1

Node fails

some time during

this period

Time t=1/3 of TIMVCFAIL

Time t=2/3 of TIMVCFAIL

Request

2

Time t=TIMVCFAIL

Virtual circuit broken

sequence of events during a state transition
Sequence of eventsDuring a State Transition
  • Determine new cluster configuration
    • If quorum is lost:
      • QUORUM capability bit removed from all CPUs
        • no process can be scheduled to run
      • Disks all put into mount verification
    • If quorum is not lost, continue…
  • Rebuild lock database
    • Stall lock requests
    • I/O synchronization
    • Do rebuild work
    • Resume lock handling
measuring state transition effects
Measuring State Transition Effects
  • Determine the type of the last lock rebuild:

$ ANALYZE/SYSTEM

SDA> READ SYS$LOADABLE_IMAGES:SCSDEF

SDA> EVALUATE @(@CLU$GL_CLUB + CLUB$B_NEWRBLD_REQ) & FF

Hex = 00000002 Decimal = 2 ACP$V_SWAPPRV

  • Rebuild type values:
    • Merge (locking not disabled)
    • Partial
    • Directory
    • Full
measuring state transition effects176
Measuring State Transition Effects
  • Determine the duration of the last lock request stall period:

SDA> DEFINE TOFF = @(@CLU$GL_CLUB+CLUB$L_TOFF)

SDA> DEFINE TON = @(@CLU$GL_CLUB+CLUB$L_TON)

SDA> EVALUATE TON-TOFF

Hex = 00000000.00000044 Decimal = 68 ADP$L_PROBE_CMD

This value is in units of 10-millisecond clock ticks (hundredths of a second), so this represents the period of lock stalling lasted 0.68 seconds

minimizing impact of state transitions
Minimizing Impactof State Transitions
  • Configurations issues:
    • Few (e.g. exactly 3) nodes
    • Quorum node; no quorum disk
    • Isolate cluster interconnects from other traffic
  • Operational issues:
    • Avoid disk rebuilds on reboot
    • Let Last-Gasp work
system parameters which affect state transition impact
System Parameters which affect State Transition Impact
  • Failure detection time:
    • TIMVCFAIL can be set as low as 100 (1 second) if needed
      • If cluster interconnect is the LAN, the minimum value for TIMVCFAIL is a value one second longer than the PEDRIVER Listen Timeout. The PEDRIVER Listen Timeout value may be viewed using SDA:

$ ANALYZE/SYSTEM

SDA> SHOW PORTS !This is to define symbols

SDA> SHOW PORT/ADDR=PE_PDT      and look for output like:

 ..."Listen Timeout       8     Hello Interval     150“

  • Reconnection interval after failure detection:
    • RECNXINTERVAL can be set as low as 1 second if needed
cluster server process csp
Cluster Server process (CSP)
  • Runs on each node
  • Assists with cluster-wide operations that require process context on a remote node, such as:
    • Mounting and dismounting volumes:
      • $MOUNT/CLUSTER and $DISMOUNT/CLUSTER
    • $BRKTHRU system service and $REPLY command
    • $SET TIME/CLUSTER
    • Distributed OPCOM communications
    • Interface between SYSMAN and SMISERVER on remote nodes
    • Cluster-Wide Process Services (CWPS)
      • Allow startup, monitoring, and control of processes on remote nodes
mscp tmscp servers
MSCP/TMSCP Servers
  • Implement Mass Storage Control Protocol
  • Provide access to disks [MSCP] and tape drives [TMSCP] for systems which do not have a direct connection to the device, through a node which does have direct access and has [T]MSCP Server software loaded
  • OpenVMS includes an MSCP server for disks and tapes
mscp tmscp servers181
MSCP/TMSCP Servers
  • MSCP disk and tape controllers (e.g. HSJ80, HSD30) also include a [T]MSCP Server, and talk the SCS protocol, but do not have a Connection Manager, so are not cluster members
openvms support for redundant hardware
OpenVMS Support for Redundant Hardware
  • OpenVMS is very good about supporting multiple (redundant) pieces of hardware
    • Nodes
    • LAN adapters
    • Storage adapters
    • Disks
      • Volume Shadowing provides RAID-1 (mirroring)
  • OpenVMS is very good at automatic:
    • Failure detection
    • Fail-over
    • Fail-back
    • Load balancing or load distribution
direct vs mscp served paths185
Direct vs. MSCP-Served Paths

Node

Node

FC Switch

FC Switch

Shadowset

direct vs mscp served paths186
Direct vs. MSCP-Served Paths

Node

Node

FC Switch

FC Switch

Shadowset

direct vs mscp served paths187
Direct vs. MSCP-Served Paths

Node

Node

FC Switch

FC Switch

Shadowset

openvms 8 3 changes improved remastering algorithm
OpenVMS 8.3 Changes:Improved Remastering Algorithm
  • New SYSGEN parameter LOCKRMWT (Lock Remaster Weight)
    • Value can be from 0 to 10
      • 0 means node never wants to be resource master
      • 10 means node always wants to be resource master
      • Values between 1 and 9 indicate a relative bias toward being resource master
    • Default value is 5
    • Difference between values on different nodes influences remastering activity (more on this shortly)
    • Parameter is dynamic
openvms 8 3 changes improved remastering algorithm190
OpenVMS 8.3 Changes:Improved Remastering Algorithm
  • New SYSGEN parameter LOCKRMWT (Lock Remaster Weight)
    • If a node with LOCKRMWT of 0-9 receives a lock request from a node with a LOCKRMWT of 10, the tree will be marked for remastering to the node with a value of 10
    • If a node with LOCKRMWT of 0 receives a lock request from a node with any higher value, the tree will be marked for remastering
    • If a node with LOCKRMWT of 1-9 receives a lock request from another node with LOCKRMWT of 1-9, it computes the difference
      • Difference will be in the range of +8 to -8
      • Calculation on this difference results in a bias based on relative percentages of lock activity (not fixed rates)
openvms 8 3 changes improved remastering algorithm192
OpenVMS 8.3 Changes:Improved Remastering Algorithm
  • Examples:
    • If a node with LOCKRMWT of 9 receives a lock request from a node with a value of 1 ((Local-Remote difference of +8), then the node with a value of 1 must be generating at least 3 times as many (200% more) lock requests for the resource tree to be marked for remastering
    • If a node with LOCKRMWT of 1 receives a lock request from another node with LOCKRMWT of 9 (Local-Remote difference of -8), then if the node with LOCKRMWT of 9 is generating at least half (50%) as many lock requests as the node with LOCKRMWT of 1 the resource tree will be marked for remastering
    • Nodes all have a default LOCKRMWT value of 5:
      • A node will have to have at least 13% more lock requests than the current resource master before lock tree will be marked for remastering
      • This introduces some “hysteresis”, to prevent thrashing
      • This algorithm automatically scales with increasing lock rates over time
openvms 8 3 changes improved remastering algorithm193
OpenVMS 8.3 Changes:Improved Remastering Algorithm

Mixed Version Clusters:

  • 8.3 nodes must interact properly with pre-8.3 nodes
  • In dealing with a pre-8.3 node, an 8.3 node must basically follow the old rules regarding LOCKDIRWT:
    • An 8.3 node will remaster a lock tree to a pre V8.3 node if the pre-8.3 node has a higher LOCKDIRWT
    • If LOCKDIRWT values are equal, then when an 8.3 node receives a lock request from a pre-8.3 node, the new computed activity threshold is used on the 8.3 node for the remastering decision
  • There are some cases that require special rules:
endless remastering

Resource

Endless Remastering?

Node A – V8.3

LOCKDIRWT: 2

LOCKRMWT: 0

Node B – V8.3

LOCKDIRWT: 0

LOCKRMWT: 10

Node C – V7.3-2

LOCKDIRWT: 1

Slide by Greg Jordan

endless remastering195

Resource

X

Never remaster to a pre-8.3 node if an

interested node has a higher LOCKDIRWT

Endless Remastering?

Node A – V8.3

LOCKDIRWT: 2

LOCKRMWT: 0

Node B – V8.3

LOCKDIRWT: 0

LOCKRMWT: 10

Node C – V7.3-2

LOCKDIRWT: 1

Slide by Greg Jordan

openvms 8 3 changes option to reduce deadlock wait
OpenVMS 8.3 Changes:Option to Reduce Deadlock Wait
  • SYSGEN parameter DEADLOCK_WAIT is in units of seconds; minimum 1 second
  • Some applications incur deadlocks as a matter of course (e.g. Rdb database)
  • Code is has been implemented in 8.3 to allow a process to request a sub-second deadlock wait time:
    • Call system service $SET_PROCESS_PROPERTIES with an item code of PPROP$C_DEADLOCK_WAIT and an additional parameter with a delta time ranging from 0 seconds to 1 second
openvms 8 3 changes pedriver window sizes
OpenVMS 8.3 Changes:PEDRIVER window sizes
  • Prior to 8.3, PEDRIVER had a limit of 31 unacknowledged packets (“window size”)
  • In multi-site clusters with long inter-site distances (i.e. hundreds of miles) or with faster LAN adapters (e.g. 10-gigabit), this could throttle activities like lock requests
  • Code is has been implemented in 8.3 to allow the PEDRIVER transmit window size to be larger:
    • PEDRIVER automatically adjusts window size based on aggregate link speed, but
    • Can manually calculate and set window sizes using SCACP:
      • SCACP>CALCULATE WINDOW_SIZE /SPEED=n –

/DISTANCE=[KILOMETERS or MILES]=d

      • SCACP>SET VC /RECEIVE_WINDOW=x and
      • SCACP>SET VC /TRANSMIT_WINDOW=x
        • (Be sure you enlarge Receive side first; Cut back transmit side first)
openvms 8 3 changes 10 gigabit nic for integrity servers
OpenVMS 8.3 Changes:10-Gigabit NIC for Integrity Servers
  • 8.3 has latent support for 10-Gigabit Ethernet on Integrity Servers
    • Support for the AB287A 10 Gigabit PCI-X LAN adapter
    • The AB287A connects to a LAN operating at 10 gigabits per second in full-duplex mode
    • The device can sustain throughput close to the bandwidth offered by the PCI or PCI-X slot it is plugged into (roughly 7 gigabits for PCI-X at 133 Mhz)
  • This is expected to reduce lock request latency to some degree compared with 1-gigabit NICs, similar to the improvement of 1-gigabit over 100 megabit (Fast Ethernet) NICs (perhaps 1/3)
openvms cluster resources
OpenVMS Cluster Resources
  • OpenVMS Documentation on the Web:
    • http://h71000.www7.hp.com/doc
      • OpenVMS Cluster Systems
      • Guidelines for OpenVMS Cluster Configurations
      • HP Volume Shadowing for OpenVMS
  • Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9
openvms cluster resources200
OpenVMS Cluster Resources
  • OpenVMS Hobbyist Program (free licenses for OpenVMS, OpenVMS Cluster Software, compilers, etc.) at http://openvmshobbyist.org/
  • VAX emulators for Windows and Linux
    • simh by Bob Supnik: http://simh.trailing-edge.com/
    • TS-10 by Timothy Stark: http://sourceforge.net/projects/ts10/
    • Charon-VAX: http://www.charon-vax.com/
openvms cluster resources201
OpenVMS Cluster Resources
  • HP IT Resource Center Forums on OpenVMS:
    • http://forums1.itrc.hp.com/service/forums/familyhome.do?familyId=288
  • Encompasserve (aka DECUServe)
    • OpenVMS system with free accounts and a friendly community of OpenVMS users
      • Telnet to encompasserve.org and log in under username REGISTRATION
  • DeathRow OpenVMS Cluster
    • http://deathrow.vistech.net/
  • Usenet newsgroups:
    • comp.os.vms, vmsnet.*, comp.sys.dec
speaker contact info
Speaker Contact Info:
  • Keith Parris
  • E-mail: Keith.Parris@hp.com
  • or keithparris@yahoo.com
  • Web: http://www2.openvms.org/kparris/