OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance Keith ParrisSystems/Software Engineer, HP

Pre-Conference SeminarsHosted by Encompass, your HP Enterprise Technology user group • For more information on Encompass visit www.encompassUS.org

Part 1:OpenVMS Cluster Internals & Data Structures Keith ParrisSystems/Software Engineer, HP

Summary of OpenVMS Cluster Features • Quorum Scheme to protect against partitioned clusters (a.k.a. Split-Brain Syndrome) • Distributed Lock Manager to coordinate access to shared resources by multiple users on multiple nodes • Cluster-wide File System for simultaneous access to the file system by multiple nodes • Common security and management environment • Cluster from the outside appears to be a single system • User environment appears the same regardless of which node they’re using

Overview of Topics • System Communication Architecture (SCA) and its guarantees • Interconnects • Connection Manager • Rule of Total Connectivity • Quorum Scheme • Distributed Lock Manager • Cluster State Transitions • Cluster Server Process

Overview of Topics • MSCP/TMSCP Servers • Support for redundant hardware • Version 8.3 Changes

OpenVMS Cluster Overview • An OpenVMS Cluster is a set of distributed systems which cooperate • Cooperation requires coordination, which requires communication

Foundation for Shared Access Users Application Application Application Node Node Node Node Node Node Distributed Lock Manager Connection Manager Rule of Total Connectivity and Quorum Scheme Shared resources (files, disks, tapes)

System Communications Architecture (SCA) • SCA governs the communications between nodes in an OpenVMS cluster

System Communications Services (SCS) • System Communications Services (SCS) is the name for the OpenVMS code which implements SCA • The terms SCA and SCS are often used interchangeably • SCS provides the foundation for communication between OpenVMS nodes on a cluster interconnect

Cluster Interconnects • SCA has been implemented on various types of hardware: • Computer Interconnect (CI) • Digital Storage Systems Interconnect (DSSI) • Fiber Distributed Data Interface (FDDI) • Ethernet (10 megabit, Fast, Gigabit) • Asynchronous Transfer Mode (ATM) LAN • Memory Channel • Galaxy Shared Memory Cluster Interconnect (SMCI)

Cluster Interconnects

Cluster Interconnects: Host CPU Overhead

Interconnects (Storage vs. Cluster) • Originally, CI was the one and only Cluster Interconnect for OpenVMS Clusters • CI allowed connection of both OpenVMS nodes and Mass Storage Control Protocol (MSCP) storage controllers • LANs allowed connections to OpenVMS nodes and LAN-based Storage Servers • SCSI and Fibre Channel allow only connections to storage – no communications to other OpenVMS nodes • So now we must differentiate between Cluster Interconnects and Storage Interconnects

Interconnects within an OpenVMS Cluster • Storage-only Interconnects • Small Computer Systems Interface (SCSI) • Fibre Channel (FC) • Cluster & Storage (combination) Interconnects • CI • DSSI • LAN • Cluster-only Interconnects (No Storage) • Memory Channel • Galaxy Shared Memory Cluster Interconnect (SMCI) • ATM LAN

System Communications Architecture (SCA) • Each node must have a unique: • SCS Node Name • SCS System ID • Flow control is credit-based

System Communications Architecture (SCA) • Layers: • SYSAPs • SCS • Ports • Interconnects

SCA Architecture Layers

LANs as a Cluster Interconnect • SCA is implemented in hardware by CI and DSSI port hardware • SCA over LANs is provided by Port Emulator software (PEDRIVER) • SCA over LANs is referred to as NISCA • NI is for Network Interconnect (an early name for Ethernet within DEC, in contrast with CI, the Computer Interconnect) • SCA over LANs is presently the center of gravity in OpenVMS Cluster interconnects (with storage on SANs)

NISCA Layering SCA NISCA

OSI Network Model

SCS with Bridges and Routers • If compared with the 7-layer OSI network reference model, SCA has no Routing (what OSI calls Network) layer • OpenVMS nodes cannot route SCS traffic on each other’s behalf • SCS protocol can be bridged transparently in an extended LAN, but not routed

SCS on LANs • Because multiple independent clusters might be present on the same LAN, each cluster is identified by a unique Cluster Group Number, which is specified when the cluster is first formed. • As a further precaution, a Cluster Password is also specified. This helps protect against the case where two clusters inadvertently use the same Cluster Group Number. If packets with the wrong Cluster Password are received, errors are logged.

Interconnect Preference by SCS • When choosing an interconnect to a node, SCS chooses one “best” interconnect type, and sends all its traffic down that one type • “Best” is defined as working properly and having the most bandwidth • If the “best” interconnect type fails, it will fail over to another • OpenVMS Clusters can use multiple LAN paths in parallel • A set of paths is dynamically selected for use at any given point in time, based on maximizing bandwidth while avoiding paths that have high latency or that tend to lose packets

Interconnect Preference by SCS • SCS tends to select paths in this priority order: • Galaxy Shared Memory Cluster Interconnect (SMCI) • Gigabit Ethernet • Memory Channel • CI • Fast Ethernet or FDDI • DSSI • 10-megabit Ethernet • OpenVMS also allows the default priorities to be overridden with the SCACP utility

LAN Packet Size Optimization • OpenVMS Clusters dynamically probe and adapt to the maximum packet size based on what actually gets through at a given point in time • Allows taking advantage of larger LAN packets sizes: • Gigabit Ethernet Jumbo Frames • FDDI

SCS Flow Control • SCS flow control is credit-based • Connections start out with a certain number of credits • Credits are used as messages are sent • A message cannot be sent unless a credit is available • Credits are returned as messages are acknowledged • This prevents one system from over-running another system’s resources

SCS • SCS provides “reliable” port-to-port communications • SCS multiplexes messages and data transfers between nodes over Virtual Circuits • SYSAPs communicate viaConnectionsover Virtual Circuits

Virtual Circuits • Formed between ports on a Cluster Interconnect of some flavor • Can pass data in 3 ways: • Datagrams • Sequenced Messages • Block Data Transfers

Connections over a Virtual Circuit Node A Node B VMS$VAXcluster VMS$VAXcluster Disk Class Driver MSCP Disk Server Tape Class Driver MSCP Tape Server Virtual Circuit

Datagrams • “Fire and forget” data transmission method • No guarantee of delivery • But high probability of successful delivery • Delivery might be out-of-order • Duplicates possible • Maximum size typically 576 bytes • SYSGEN parameter SCSMAXDG (max. 985) • Example uses for Datagrams: • Polling for new nodes • Virtual Circuit formation • Logging asynchronous errors

Sequenced Messages • Guaranteed delivery (no lost messages) • Guaranteed ordering (first-in, first-out delivery; same order as sent) • Guarantee of no duplicates • Maximum size presently 216 bytes • SYSGEN parameter SCSMAXMSG (max. 985) • Example uses for Sequenced Messages: • Lock requests • MSCP I/O requests, and MSCP End Messages returning I/O status

Block Data Transfers • Used to move larger amounts of bulk data (too large for a sequenced message) • Data is mapped into “Named Buffers” • which specify location and size of memory area • Data movement can be initiated in either direction: • Send Data • Request Data • Example uses for Block Data Transfers: • Disk or tape data transfers • OPCOM messages • Lock/Resource Tree remastering

System Applications (SYSAPs) • Despite the name, these are pieces of the operating system, not user applications • Work in pairs (one on the node at each end of the connection) for specific purposes • Communicate using a Connection formed over a Virtual Circuit between nodes • Although unrelated to OpenVMS user processes, each is given a “Process Name”

SYSAPs

OpenVMS Cluster Data Structures for SCA

OpenVMS Cluster Data Structuresfor Cluster Communications Using SCS • SCS Data Structures • SB (System Block) • PB (Path Block) • PDT (Port Descriptor Table) • CDT (Connection Descriptor Table)

SCS Data Structures:SB (System Block) • Data displayed by SHOW CLUSTER/CONTINUOUS ADD SYSTEMS/ALL • One SB per SCS node • OpenVMS node or Controller node which uses SCS

SCS Data Structures:SB (System Block) • Data of interest: • SCS Node Name and SCS System ID • CPU hardware type/model, and HW revision level • Software type and version • Software incarnation • Number of paths to this node (virtual circuits) • Maximum sizes for datagrams and sequenced messages sent via this path

SHOW CLUSTER/CONTINUOUS display of System Block (SB) View of Cluster from system ID 14733 node: ALPHA1 12-SEP-2006 23:07:02 +—————————————————————————————————————————————————————————————————————————————+ | SYSTEMS | |————————-——————————-————————————————————————————————-——————————————-—————————| | NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR | |————————+——————————+————————————————————————————————+——————————————+—————————| | ALPHA1 | 14733 | AlphaServer ES45 Model 2 | VMS V7.3-2 | 1 | | MARVEL | 14369 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 | | ONEU | 14488 | AlphaServer DS10L 617 MHz | VMS V7.3-2 | 1 | | MARVL2 | 14487 | AlphaServer GS320 6/731 | VMS V8.3 | 1 | | MARVL3 | 14486 | AlphaServer GS320 6/731 | VMS V7.3-2 | 1 | | IA64 | 14362 | HP rx4640 (1.30GHz/3.0MB) | VMS V8.2-1 | 1 | | MARVL4 | 14368 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 | | WLDFIR | 14485 | AlphaServer GS320 6/731 | VMS V8.2 | 1 | | LTLMRV | 14370 | hp AlphaServer ES47 7/1000 | VMS V8.2 | 1 | | WILDFR | 14484 | AlphaServer GS320 6/731 | VMS V8.3 | 1 | | AS4000 | 14423 | AlphaServer 4X00 5/466 4MB | VMS V7.3-2 | 1 | +————————-——————————-————————————————————————————————-——————————————-—————————+

SHOW CLUSTER/CONTINUOUS display of System Block (SB) View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:28:53 +—————————————————————————————————————————————————————————————————————————+ | SYSTEMS | |————————-——————————-————————————————————————————————-——————————-—————————| | NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR | |————————+——————————+————————————————————————————————+——————————+—————————| | VX6360 | 29831 | VAX 6000-360 | VMS V7.3 | 2 | | HSJ501 | ******** | HSJ5 | HSJ V52J | 1 | | HSJ401 | ******** | HSJ4 | HSJ YF06 | 1 | | HSJ402 | ******** | HSJ4 | HSJ V34J | 1 | | HSC951 | 65025 | HS95 | HSC V851 | 1 | | HSJ403 | ******** | HSJ4 | HSJ V32J | 1 | | HSJ404 | ******** | HSJ4 | HSJ YF06 | 1 | | HSJ503 | ******** | HSJ5 | HSJ V57J | 1 | | HSJ504 | ******** | HSJ5 | HSJ V57J | 1 | | VX6440 | 30637 | VAX 6000-440 | VMS V7.3 | 2 | | WILDFR | 29740 | AlphaServer GS160 6/731 | VMS V7.3 | 2 | | WLDFIR | 29768 | AlphaServer GS160 6/731 | VMS V7.3 | 2 | | VX7620 | 30159 | VAX 7000-620 | VMS V7.3 | 2 | | VX6640 | 29057 | VAX 6000-640 | VMS V7.3 | 2 | | MV3900 | 29056 | MicroVAX 3900 Series | VMS V7.3 | 1 | +————————-——————————-————————————————————————————————-——————————-—————————+

SCS Data Structures:PB (Path Block) • Data displayed by $ SHOW CLUSTER/CONTINUOUS ADD CIRCUITS/ALL • One PB per virtual circuit to a given node

SCS Data Structures:PB (Path Block) • Data of interest: • Virtual circuit status • Number of connections over this path • Local port device name • Remote port ID • Hardware port type, and hardware revision level • Bit-mask of functions the remote port can perform • Load class value (so SCS can select the optimal path)

SHOW CLUSTER/CONTINUOUS display of Path Block (PB) View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:51:57 +————————-———————————————————————————————————————————————————————————————————+ | SYSTEMS| CIRCUITS | |————————+———————-———————-—————————-—————————-—————————-——————————-——————————| | NODE | LPORT | RPORT | RP_TYPE | NUM_CON | CIR_STA | RP_REVIS | RP_FUNCT | |————————+———————+———————+—————————+—————————+—————————+——————————+——————————| | VX6360 | PAA0 | 13 | CIBCA-B | 0 | OPEN | 40074002 | FFFF0F00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | VX6440 | PAA0 | 15 | CIXCD | 6 | OPEN | 46 | ABFF0D00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | WILDFR | PAA0 | 14 | CIPCA | 5 | OPEN | 20 | ABFF0D00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | WLDFIR | PAA0 | 11 | CIPCA | 4 | OPEN | 20 | ABFF0D00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | VX7620 | PAA0 | 3 | CIXCD | 5 | OPEN | 47 | ABFF0D00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | VX6640 | PAA0 | 8 | CIXCD | 5 | OPEN | 47 | ABFF0D00 | | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 | | MV3900 | PEA0 | | LAN | 5 | OPEN | 105 | 83FF0180 | +————————-———————-———————-—————————-—————————-—————————-——————————-——————————+

SHOW CLUSTER/CONTINUOUS display of Path Block (PB) View of Cluster from system ID 14733 node: AlPHA1 12-SEP-2006 23:59:20 +————————-——————————————————————————————————————————————————————————————+ | SYSTEMS| CIRCUITS | |————————+——————-———————-—————————-———————-—————————-——————————-————————| | NODE | LPOR | RP_TY | CIR_STA | LD_CL | NUM_CON | RP_FUNCT | RP_OWN | |————————+——————+———————+—————————+———————+—————————+——————————+————————| | ALPHA1 | PEA0 | LAN | OPEN | 0 | 0 | 83FF0180 | 0 | | MARVEL | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 | | ONEU | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 | | MARVL2 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 | | MARVL3 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 | | IA64 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 | | MARVL4 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 | | WLDFIR | PEA0 | LAN | OPEN | 200 | 4 | 83FF0180 | 0 | | LTLMRV | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 | | WLDFIR | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 | | AS4000 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 | +————————-——————-———————-—————————-———————-—————————-——————————-————————+

SCS Data Structures:PDT (Port Descriptor Table) • Data displayed by $ SHOW CLUSTER/CONTINUOUS ADD LOCAL_PORTS/ALL • One PDT per local SCS port device • LANs all fit under one port: PEA0

SCS Data Structures:PDT (Port Descriptor Table) • Data of interest: • Port device name • Type of port • i.e. VAX CI, NPORT CI, DSSI, NI, MC, Galaxy SMCI • Local port ID, and maximum ID for this interconnect • Count of open connections using this port • Pointers to routines to perform various port functions • Load Class value • Counters of various types of traffic • Maximum data transfer size

SHOW CLUSTER/CONTINUOUS display of Port Descriptor Table (PDT) entries View of Cluster from system ID 28831 node: VX6360 12-SEP-2006 22:17:28 +———————————————————————————————————————————————————————————————————+ | LOCAL_PORTS | |————————————-——————-————————-——————-—————-—————-————————-——————————| | LSYSTEM_ID | NAME | LP_STA | PORT | DGS | MSG | LP_TYP | MAX_PORT | |————————————+——————+————————+——————+—————+—————+————————+——————————| | 29831 | PAA0 | ONLINE | 13 | 15+ | 15+ | CIBCA | 15 | | 29831 | PEA0 | ONLINE | 0 | 15+ | 6 | LAN | 223 | +————————————-——————-————————-——————-—————-—————-————————-——————————+

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance