Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 204

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance PowerPoint PPT Presentation


OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance. Keith Parris Systems / Software Engineer, HP. Pre-Conference Seminars Hosted by Encompass, your HP Enterprise Technology user group. For more information on Encompass visit www.encompassUS.org.

Related searches for OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

Download Presentation

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Openvms cluster internals data structures and using openvms clusters for disaster tolerance l.jpg

OpenVMS Cluster Internals & Data Structures, and Using OpenVMS Clusters for Disaster Tolerance

Keith ParrisSystems/Software Engineer, HP


Pre conference seminars hosted by encompass your hp enterprise technology user group l.jpg

Pre-Conference SeminarsHosted by Encompass, your HP Enterprise Technology user group

  • For more information on Encompass visit www.encompassUS.org


Part 1 openvms cluster internals data structures l.jpg

Part 1:OpenVMS Cluster Internals & Data Structures

Keith ParrisSystems/Software Engineer, HP


Summary of openvms cluster features l.jpg

Summary of OpenVMS Cluster Features

  • Quorum Scheme to protect against partitioned clusters (a.k.a. Split-Brain Syndrome)

  • Distributed Lock Manager to coordinate access to shared resources by multiple users on multiple nodes

  • Cluster-wide File System for simultaneous access to the file system by multiple nodes

  • Common security and management environment

  • Cluster from the outside appears to be a single system

  • User environment appears the same regardless of which node they’re using


Overview of topics l.jpg

Overview of Topics

  • System Communication Architecture (SCA) and its guarantees

    • Interconnects

  • Connection Manager

    • Rule of Total Connectivity

    • Quorum Scheme

  • Distributed Lock Manager

  • Cluster State Transitions

  • Cluster Server Process


Overview of topics7 l.jpg

Overview of Topics

  • MSCP/TMSCP Servers

  • Support for redundant hardware

  • Version 8.3 Changes


Openvms cluster overview l.jpg

OpenVMS Cluster Overview

  • An OpenVMS Cluster is a set of distributed systems which cooperate

  • Cooperation requires coordination, which requires communication


Foundation for shared access l.jpg

Foundation for Shared Access

Users

Application

Application

Application

Node

Node

Node

Node

Node

Node

Distributed Lock Manager

Connection Manager

Rule of Total Connectivity and Quorum Scheme

Shared resources (files, disks, tapes)


System communications architecture sca l.jpg

System Communications Architecture (SCA)

  • SCA governs the communications between nodes in an OpenVMS cluster


System communications services scs l.jpg

System Communications Services (SCS)

  • System Communications Services (SCS) is the name for the OpenVMS code which implements SCA

    • The terms SCA and SCS are often used interchangeably

  • SCS provides the foundation for communication between OpenVMS nodes on a cluster interconnect


Cluster interconnects l.jpg

Cluster Interconnects

  • SCA has been implemented on various types of hardware:

    • Computer Interconnect (CI)

    • Digital Storage Systems Interconnect (DSSI)

    • Fiber Distributed Data Interface (FDDI)

    • Ethernet (10 megabit, Fast, Gigabit)

    • Asynchronous Transfer Mode (ATM) LAN

    • Memory Channel

    • Galaxy Shared Memory Cluster Interconnect (SMCI)


Cluster interconnects13 l.jpg

Cluster Interconnects


Cluster interconnects host cpu overhead l.jpg

Cluster Interconnects: Host CPU Overhead


Interconnects storage vs cluster l.jpg

Interconnects (Storage vs. Cluster)

  • Originally, CI was the one and only Cluster Interconnect for OpenVMS Clusters

    • CI allowed connection of both OpenVMS nodes and Mass Storage Control Protocol (MSCP) storage controllers

  • LANs allowed connections to OpenVMS nodes and LAN-based Storage Servers

  • SCSI and Fibre Channel allow only connections to storage – no communications to other OpenVMS nodes

    • So now we must differentiate between Cluster Interconnects and Storage Interconnects


Interconnects within an openvms cluster l.jpg

Interconnects within an OpenVMS Cluster

  • Storage-only Interconnects

    • Small Computer Systems Interface (SCSI)

    • Fibre Channel (FC)

  • Cluster & Storage (combination) Interconnects

    • CI

    • DSSI

    • LAN

  • Cluster-only Interconnects (No Storage)

    • Memory Channel

    • Galaxy Shared Memory Cluster Interconnect (SMCI)

    • ATM LAN


System communications architecture sca17 l.jpg

System Communications Architecture (SCA)

  • Each node must have a unique:

    • SCS Node Name

    • SCS System ID

  • Flow control is credit-based


System communications architecture sca18 l.jpg

System Communications Architecture (SCA)

  • Layers:

    • SYSAPs

    • SCS

    • Ports

    • Interconnects


Sca architecture layers l.jpg

SCA Architecture Layers


Lans as a cluster interconnect l.jpg

LANs as a Cluster Interconnect

  • SCA is implemented in hardware by CI and DSSI port hardware

  • SCA over LANs is provided by Port Emulator software (PEDRIVER)

    • SCA over LANs is referred to as NISCA

      • NI is for Network Interconnect (an early name for Ethernet within DEC, in contrast with CI, the Computer Interconnect)

  • SCA over LANs is presently the center of gravity in OpenVMS Cluster interconnects (with storage on SANs)


Nisca layering l.jpg

NISCA Layering

SCA

NISCA


Osi network model l.jpg

OSI Network Model


Osi network model23 l.jpg

OSI Network Model


Scs with bridges and routers l.jpg

SCS with Bridges and Routers

  • If compared with the 7-layer OSI network reference model, SCA has no Routing (what OSI calls Network) layer

  • OpenVMS nodes cannot route SCS traffic on each other’s behalf

  • SCS protocol can be bridged transparently in an extended LAN, but not routed


Scs on lans l.jpg

SCS on LANs

  • Because multiple independent clusters might be present on the same LAN, each cluster is identified by a unique Cluster Group Number, which is specified when the cluster is first formed.

  • As a further precaution, a Cluster Password is also specified. This helps protect against the case where two clusters inadvertently use the same Cluster Group Number. If packets with the wrong Cluster Password are received, errors are logged.


Interconnect preference by scs l.jpg

Interconnect Preference by SCS

  • When choosing an interconnect to a node, SCS chooses one “best” interconnect type, and sends all its traffic down that one type

    • “Best” is defined as working properly and having the most bandwidth

  • If the “best” interconnect type fails, it will fail over to another

  • OpenVMS Clusters can use multiple LAN paths in parallel

    • A set of paths is dynamically selected for use at any given point in time, based on maximizing bandwidth while avoiding paths that have high latency or that tend to lose packets


  • Interconnect preference by scs27 l.jpg

    Interconnect Preference by SCS

    • SCS tends to select paths in this priority order:

      • Galaxy Shared Memory Cluster Interconnect (SMCI)

      • Gigabit Ethernet

      • Memory Channel

      • CI

      • Fast Ethernet or FDDI

      • DSSI

      • 10-megabit Ethernet

    • OpenVMS also allows the default priorities to be overridden with the SCACP utility


    Lan packet size optimization l.jpg

    LAN Packet Size Optimization

    • OpenVMS Clusters dynamically probe and adapt to the maximum packet size based on what actually gets through at a given point in time

    • Allows taking advantage of larger LAN packets sizes:

      • Gigabit Ethernet Jumbo Frames

      • FDDI


    Scs flow control l.jpg

    SCS Flow Control

    • SCS flow control is credit-based

    • Connections start out with a certain number of credits

      • Credits are used as messages are sent

      • A message cannot be sent unless a credit is available

      • Credits are returned as messages are acknowledged

    • This prevents one system from over-running another system’s resources


    Slide30 l.jpg

    SCS

    • SCS provides “reliable” port-to-port communications

    • SCS multiplexes messages and data transfers between nodes over Virtual Circuits

    • SYSAPs communicate viaConnectionsover Virtual Circuits


    Virtual circuits l.jpg

    Virtual Circuits

    • Formed between ports on a Cluster Interconnect of some flavor

    • Can pass data in 3 ways:

      • Datagrams

      • Sequenced Messages

      • Block Data Transfers


    Connections over a virtual circuit l.jpg

    Connections over a Virtual Circuit

    Node A

    Node B

    VMS$VAXcluster

    VMS$VAXcluster

    Disk Class Driver

    MSCP Disk Server

    Tape Class Driver

    MSCP Tape Server

    Virtual Circuit


    Datagrams l.jpg

    Datagrams

    • “Fire and forget” data transmission method

    • No guarantee of delivery

      • But high probability of successful delivery

  • Delivery might be out-of-order

  • Duplicates possible

  • Maximum size typically 576 bytes

    • SYSGEN parameter SCSMAXDG (max. 985)

  • Example uses for Datagrams:

    • Polling for new nodes

    • Virtual Circuit formation

    • Logging asynchronous errors


  • Sequenced messages l.jpg

    Sequenced Messages

    • Guaranteed delivery (no lost messages)

    • Guaranteed ordering (first-in, first-out delivery; same order as sent)

    • Guarantee of no duplicates

    • Maximum size presently 216 bytes

      • SYSGEN parameter SCSMAXMSG (max. 985)

  • Example uses for Sequenced Messages:

    • Lock requests

    • MSCP I/O requests, and MSCP End Messages returning I/O status


  • Block data transfers l.jpg

    Block Data Transfers

    • Used to move larger amounts of bulk data (too large for a sequenced message)

    • Data is mapped into “Named Buffers”

      • which specify location and size of memory area

  • Data movement can be initiated in either direction:

    • Send Data

    • Request Data

  • Example uses for Block Data Transfers:

    • Disk or tape data transfers

    • OPCOM messages

    • Lock/Resource Tree remastering


  • System applications sysaps l.jpg

    System Applications (SYSAPs)

    • Despite the name, these are pieces of the operating system, not user applications

    • Work in pairs (one on the node at each end of the connection) for specific purposes

    • Communicate using a Connection formed over a Virtual Circuit between nodes

    • Although unrelated to OpenVMS user processes, each is given a “Process Name”


    Sysaps l.jpg

    SYSAPs


    Slide38 l.jpg

    OpenVMS Cluster Data Structures for SCA


    Openvms cluster data structures for cluster communications using scs l.jpg

    OpenVMS Cluster Data Structuresfor Cluster Communications Using SCS

    • SCS Data Structures

      • SB (System Block)

      • PB (Path Block)

      • PDT (Port Descriptor Table)

      • CDT (Connection Descriptor Table)


    Scs data structures sb system block l.jpg

    SCS Data Structures:SB (System Block)

    • Data displayed by

      SHOW CLUSTER/CONTINUOUS

      ADD SYSTEMS/ALL

    • One SB per SCS node

      • OpenVMS node or Controller node which uses SCS


    Scs data structures sb system block41 l.jpg

    SCS Data Structures:SB (System Block)

    • Data of interest:

      • SCS Node Name and SCS System ID

      • CPU hardware type/model, and HW revision level

      • Software type and version

      • Software incarnation

      • Number of paths to this node (virtual circuits)

      • Maximum sizes for datagrams and sequenced messages sent via this path


    Show cluster continuous display of system block sb l.jpg

    SHOW CLUSTER/CONTINUOUS display of System Block (SB)

    View of Cluster from system ID 14733 node: ALPHA1 12-SEP-2006 23:07:02

    +—————————————————————————————————————————————————————————————————————————————+

    | SYSTEMS |

    |————————-——————————-————————————————————————————————-——————————————-—————————|

    | NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR |

    |————————+——————————+————————————————————————————————+——————————————+—————————|

    | ALPHA1 | 14733 | AlphaServer ES45 Model 2 | VMS V7.3-2 | 1 |

    | MARVEL | 14369 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 |

    | ONEU | 14488 | AlphaServer DS10L 617 MHz | VMS V7.3-2 | 1 |

    | MARVL2 | 14487 | AlphaServer GS320 6/731 | VMS V8.3 | 1 |

    | MARVL3 | 14486 | AlphaServer GS320 6/731 | VMS V7.3-2 | 1 |

    | IA64 | 14362 | HP rx4640 (1.30GHz/3.0MB) | VMS V8.2-1 | 1 |

    | MARVL4 | 14368 | hp AlphaServer GS1280 7/1150 | VMS V8.2 | 1 |

    | WLDFIR | 14485 | AlphaServer GS320 6/731 | VMS V8.2 | 1 |

    | LTLMRV | 14370 | hp AlphaServer ES47 7/1000 | VMS V8.2 | 1 |

    | WILDFR | 14484 | AlphaServer GS320 6/731 | VMS V8.3 | 1 |

    | AS4000 | 14423 | AlphaServer 4X00 5/466 4MB | VMS V7.3-2 | 1 |

    +————————-——————————-————————————————————————————————-——————————————-—————————+


    Show cluster continuous display of system block sb43 l.jpg

    SHOW CLUSTER/CONTINUOUS display of System Block (SB)

    View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:28:53

    +—————————————————————————————————————————————————————————————————————————+

    | SYSTEMS |

    |————————-——————————-————————————————————————————————-——————————-—————————|

    | NODE | SYS_ID | HW_TYPE | SOFTWARE | NUM_CIR |

    |————————+——————————+————————————————————————————————+——————————+—————————|

    | VX6360 | 29831 | VAX 6000-360 | VMS V7.3 | 2 |

    | HSJ501 | ******** | HSJ5 | HSJ V52J | 1 |

    | HSJ401 | ******** | HSJ4 | HSJ YF06 | 1 |

    | HSJ402 | ******** | HSJ4 | HSJ V34J | 1 |

    | HSC951 | 65025 | HS95 | HSC V851 | 1 |

    | HSJ403 | ******** | HSJ4 | HSJ V32J | 1 |

    | HSJ404 | ******** | HSJ4 | HSJ YF06 | 1 |

    | HSJ503 | ******** | HSJ5 | HSJ V57J | 1 |

    | HSJ504 | ******** | HSJ5 | HSJ V57J | 1 |

    | VX6440 | 30637 | VAX 6000-440 | VMS V7.3 | 2 |

    | WILDFR | 29740 | AlphaServer GS160 6/731 | VMS V7.3 | 2 |

    | WLDFIR | 29768 | AlphaServer GS160 6/731 | VMS V7.3 | 2 |

    | VX7620 | 30159 | VAX 7000-620 | VMS V7.3 | 2 |

    | VX6640 | 29057 | VAX 6000-640 | VMS V7.3 | 2 |

    | MV3900 | 29056 | MicroVAX 3900 Series | VMS V7.3 | 1 |

    +————————-——————————-————————————————————————————————-——————————-—————————+


    Scs data structures pb path block l.jpg

    SCS Data Structures:PB (Path Block)

    • Data displayed by

      $ SHOW CLUSTER/CONTINUOUS

      ADD CIRCUITS/ALL

    • One PB per virtual circuit to a given node


    Scs data structures pb path block45 l.jpg

    SCS Data Structures:PB (Path Block)

    • Data of interest:

      • Virtual circuit status

      • Number of connections over this path

      • Local port device name

      • Remote port ID

      • Hardware port type, and hardware revision level

      • Bit-mask of functions the remote port can perform

      • Load class value (so SCS can select the optimal path)


    Show cluster continuous display of path block pb l.jpg

    SHOW CLUSTER/CONTINUOUS display of Path Block (PB)

    View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 21:51:57

    +————————-———————————————————————————————————————————————————————————————————+

    | SYSTEMS| CIRCUITS |

    |————————+———————-———————-—————————-—————————-—————————-——————————-——————————|

    | NODE | LPORT | RPORT | RP_TYPE | NUM_CON | CIR_STA | RP_REVIS | RP_FUNCT |

    |————————+———————+———————+—————————+—————————+—————————+——————————+——————————|

    | VX6360 | PAA0 | 13 | CIBCA-B | 0 | OPEN | 40074002 | FFFF0F00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | VX6440 | PAA0 | 15 | CIXCD | 6 | OPEN | 46 | ABFF0D00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | WILDFR | PAA0 | 14 | CIPCA | 5 | OPEN | 20 | ABFF0D00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | WLDFIR | PAA0 | 11 | CIPCA | 4 | OPEN | 20 | ABFF0D00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | VX7620 | PAA0 | 3 | CIXCD | 5 | OPEN | 47 | ABFF0D00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | VX6640 | PAA0 | 8 | CIXCD | 5 | OPEN | 47 | ABFF0D00 |

    | | PEA0 | | LAN | 0 | OPEN | 105 | 83FF0180 |

    | MV3900 | PEA0 | | LAN | 5 | OPEN | 105 | 83FF0180 |

    +————————-———————-———————-—————————-—————————-—————————-——————————-——————————+


    Show cluster continuous display of path block pb47 l.jpg

    SHOW CLUSTER/CONTINUOUS display of Path Block (PB)

    View of Cluster from system ID 14733 node: AlPHA1 12-SEP-2006 23:59:20

    +————————-——————————————————————————————————————————————————————————————+

    | SYSTEMS| CIRCUITS |

    |————————+——————-———————-—————————-———————-—————————-——————————-————————|

    | NODE | LPOR | RP_TY | CIR_STA | LD_CL | NUM_CON | RP_FUNCT | RP_OWN |

    |————————+——————+———————+—————————+———————+—————————+——————————+————————|

    | ALPHA1 | PEA0 | LAN | OPEN | 0 | 0 | 83FF0180 | 0 |

    | MARVEL | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

    | ONEU | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 |

    | MARVL2 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

    | MARVL3 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

    | IA64 | PEA0 | LAN | OPEN | 100 | 3 | 83FF0180 | 0 |

    | MARVL4 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

    | WLDFIR | PEA0 | LAN | OPEN | 200 | 4 | 83FF0180 | 0 |

    | LTLMRV | PEA0 | LAN | OPEN | 100 | 4 | 83FF0180 | 0 |

    | WLDFIR | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

    | AS4000 | PEA0 | LAN | OPEN | 200 | 3 | 83FF0180 | 0 |

    +————————-——————-———————-—————————-———————-—————————-——————————-————————+


    Scs data structures pdt port descriptor table l.jpg

    SCS Data Structures:PDT (Port Descriptor Table)

    • Data displayed by

      $ SHOW CLUSTER/CONTINUOUS

      ADD LOCAL_PORTS/ALL

    • One PDT per local SCS port device

      • LANs all fit under one port: PEA0


    Scs data structures pdt port descriptor table49 l.jpg

    SCS Data Structures:PDT (Port Descriptor Table)

    • Data of interest:

      • Port device name

      • Type of port

        • i.e. VAX CI, NPORT CI, DSSI, NI, MC, Galaxy SMCI

      • Local port ID, and maximum ID for this interconnect

      • Count of open connections using this port

      • Pointers to routines to perform various port functions

      • Load Class value

      • Counters of various types of traffic

      • Maximum data transfer size


    Show cluster continuous display of port descriptor table pdt entries l.jpg

    SHOW CLUSTER/CONTINUOUS display of Port Descriptor Table (PDT) entries

    View of Cluster from system ID 28831 node: VX6360 12-SEP-2006 22:17:28

    +———————————————————————————————————————————————————————————————————+

    | LOCAL_PORTS |

    |————————————-——————-————————-——————-—————-—————-————————-——————————|

    | LSYSTEM_ID | NAME | LP_STA | PORT | DGS | MSG | LP_TYP | MAX_PORT |

    |————————————+——————+————————+——————+—————+—————+————————+——————————|

    | 29831 | PAA0 | ONLINE | 13 | 15+ | 15+ | CIBCA | 15 |

    | 29831 | PEA0 | ONLINE | 0 | 15+ | 6 | LAN | 223 |

    +————————————-——————-————————-——————-—————-—————-————————-——————————+


    Scs data structures cdt connection descriptor table l.jpg

    SCS Data Structures:CDT (Connection Descriptor Table)

    • Data displayed by

      $ SHOW CLUSTER/CONTINUOUS

      ADD CONNECTIONS/ALL

      ADD COUNTERS/ALL


    Scs data structures cdt connection descriptor table52 l.jpg

    SCS Data Structures:CDT (Connection Descriptor Table)

    • Data of interest:

      • Local and remote SYSAP names

      • Local and remote Connection IDs

      • Connection status

      • Pointer to System Block (SB) of the node at the other end of the connection

      • Counters of various types of traffic


    Show cluster continuous display of connection descriptor table cdt entries l.jpg

    SHOW CLUSTER/CONTINUOUS display of Connection Descriptor Table (CDT) entries

    View of Cluster from system ID 29831 node: VX6360 12-SEP-2006 22:28:42

    +————————-—————————————————————————————————————-——————————————————————————+

    | SYSTEMS| CONNECTIONS | COUNTERS |

    |————————+——————————————————-——————————————————+—————————-—————————-——————|

    | NODE | LOC_PROC_NAME | REM_PROC_NAME | MSGS_SE | MSGS_RC | CR_W |

    |————————+——————————————————+——————————————————+—————————+—————————+——————|

    | VX6360 | SCS$DIRECTORY | ? | | | |

    | | VMS$SDA | ? | | | |

    | | VMS$VAXcluster | ? | | | |

    | | SCA$TRANSPORT | ? | | | |

    | | MSCP$DISK | ? | | | |

    | | MSCP$TAPE | ? | | | |

    | VX6440 | SCA$TRANSPORT | SCA$TRANSPORT | 89219 | 193862 | 0 |

    | | VMS$DISK_CL_DRVR | MSCP$DISK | 95239 | 95243 | 0 |

    | | MSCP$DISK | VMS$DISK_CL_DRVR | 95251 | 95250 | 0 |

    | | MSCP$TAPE | VMS$TAPE_CL_DRVR | 1246 | 1246 | 0 |

    | | VMS$TAPE_CL_DRVR | MSCP$TAPE | 1242 | 1247 | 0 |

    | | VMS$VAXcluster | VMS$VAXcluster | 2792737 | 2316397 | 211 |

    | WILDFR | MSCP$TAPE | VMS$TAPE_CL_DRVR | 1994 | 1994 | 0 |

    | | MSCP$DISK | VMS$DISK_CL_DRVR | 152618 | 152617 | 0 |

    | | VMS$DISK_CL_DRVR | MSCP$DISK | 985269 | 985274 | 0 |

    | | VMS$TAPE_CL_DRVR | MSCP$TAPE | 1990 | 1990 | 0 |

    | | VMS$VAXcluster | VMS$VAXcluster | 227378 | 439169 | 0 |


    Slide54 l.jpg

    Connection Manager


    Connection manager l.jpg

    Connection Manager

    • The Connection Manager is code within OpenVMS that coordinates cluster membership across events such as:

      • Forming a cluster initially

      • Allowing a node to join the cluster

      • Cleaning up after a node which has failed or left the cluster

    • all the while protecting against uncoordinated access to shared resources such as disks


    Rule of total connectivity l.jpg

    Rule of Total Connectivity

    • Every system must be able to talk “directly” with every other system in the cluster

      • Without having to go through another system

      • Transparent LAN bridges are considered a “direct” connection

    • If the Rule of Total Connectivity gets broken, the cluster will need to be adjusted so all members can once again talk with one another and meet the Rule.


    Quorum scheme l.jpg

    Quorum Scheme

    • The Connection Manager enforces the Quorum Scheme to ensure that all access to shared resources is coordinated

      • Basic idea: A majority of the potential cluster systems must be present in the cluster before any access to shared resources (i.e. disks) is allowed

    • Idea comes from familiar parliamentary procedures

      • As in human parliamentary procedure, requiring a quorum before doing business prevents two or more subsets of members from meeting simultaneously and doing conflicting business


    Quorum scheme58 l.jpg

    Quorum Scheme

    • Systems (and sometimes disks) are assigned values for the number of votes they have in determining a majority

    • The total number of votes possible is called the “Expected Votes” – the number of votes to be expected when all cluster members are present

    • “Quorum” is defined to be a simple majority (just over half) of the total possible (the Expected) votes


    Quorum schemes l.jpg

    Quorum Schemes

    • In the event of a communications failure,

      • Systems in the minority voluntarily suspend processing, while

      • Systems in the majority can continue to process transactions


    Quorum scheme60 l.jpg

    Quorum Scheme

    • If a cluster member is not part of a cluster with quorum, OpenVMS keeps it from doing any harm by:

      • Putting all disks into Mount Verify state, thus stalling all disk I/O operations

      • Requiring that all processes can only be scheduled to run on a CPU with the QUORUM capability bit set

      • Clearing the QUORUM capability bit on all CPUs in the system, thus preventing any process from being scheduled to run on a CPU and doing any work


    Quorum schemes61 l.jpg

    Quorum Schemes

    • To handle cases where there are an even number of votes

      • For example, with only 2 systems,

      • Or half of the votes in the cluster are at each of 2 geographically-separated sites

    • provision may be made for

      • a tie-breaking vote, or

      • human intervention


    Quorum schemes tie breaking vote l.jpg

    Quorum Schemes:Tie-breaking vote

    • This can be provided by a disk:

      • Quorum Disk

    • Or an extra system with a vote:

      • Additional cluster member node, called a “Quorum Node”


    Quorum scheme63 l.jpg

    Quorum Scheme

    • A “quorum disk” can be assigned votes

      • OpenVMS periodically writes cluster membership info into the QUORUM.DAT file on the quorum disk and later reads it to re-check it; if all is well, OpenVMS can treat the quorum disk as a virtual voting member of the cluster

      • Two cluster subsets will “see” each other through their modifications of the QUORUM.DAT file as they poll its contents, and refuse to count the quorum disk’s votes


    Quorum loss l.jpg

    Quorum Loss

    • If too many systems leave the cluster, there may no longer be a quorum of votes left

    • It is possible to manually force the cluster to recalculate quorum and continue processing if needed


    Quorum recovery methods l.jpg

    Quorum Recovery Methods

    • Software interrupt at IPL 12 from console

      • IPC> Q

    • Availability Manager or DECamds:

      • System Fix; Adjust Quorum

    • DTCS or BRS integrated tool, using same RMDRIVER (data provider client) interface as Availability Manager and DECamds


    Quorum scheme66 l.jpg

    Quorum Scheme

    • If two non-cooperating subsets of cluster nodes both achieve quorum and access shared resources, this is known as a “partitioned cluster” (a.k.a. “Split Brain Syndrome”)

    • When a partitioned cluster occurs, the disk structure on shared disks is quickly corrupted


    Quorum scheme67 l.jpg

    Quorum Scheme

    • Avoid a partitioned cluster by:

      • Proper setting of EXPECTED_VOTES parameter to the total of all possible votes

        • Note: OpenVMS ratchets up the dynamic cluster-wide value of Expected Votes as votes are added

        • When temporarily lowering EXPECTED_VOTES at boot time, also set WRITESYSPARAMS to zero to avoid writing a low value of EXPECTED_VOTES to the Current SYSGEN parameters

      • Make manual quorum adjustments with care:

        • Availability Manager software (or DECamds)

        • System SHUTDOWN with REMOVE_NODE option


    Connection manager and transient failures l.jpg

    Connection Manager and Transient Failures

    • Some communications failures are temporary and transient

      • Especially in a LAN environment

    • To prevent the disruption of unnecessary removal of a node from the cluster, when a communications failure is detected, the Connection Manager waits for a time in hopes of the problem going away by itself

      • This time is called the Reconnection Interval

        • SYSGEN parameter RECNXINTERVAL

          • RECNXINTERVAL is dynamic and may thus be temporarily raised if needed for something like a scheduled LAN outage


    Connection manager and communications or node failures l.jpg

    Connection Manager and Communications or Node Failures

    • If the Reconnection Interval passes without connectivity being restored, or if the node has “gone away”, the cluster cannot continue without a reconfiguration

    • This reconfiguration is called a State Transition, and one or more nodes will be removed from the cluster


    Optimal sub cluster selection l.jpg

    Optimal Sub-cluster Selection

    • Connection manager compares potential node subsets that could make up surviving portion of the cluster

    • Pick sub-cluster with the most votes

    • If votes are tied, pick sub-cluster with the most nodes

    • If nodes are tied, arbitrarily pick a winner

      • based on comparing SCSSYSTEMID values of the set of nodes with the most-recent cluster software revision


    Lan reliability l.jpg

    LAN reliability

    • VOTES:

      • Most configurations with satellite nodes give votes to disk/boot servers and set VOTES=0 on all satellite nodes

        • If the sole LAN adapter on a disk/boot server fails, and it has a vote, ALL satellites will leave the cluster

      • Advice: give at least as many votes to node(s) on the LAN as any single server has, or configure redundant LAN adapters


    Lan redundancy and votes l.jpg

    LAN redundancy and Votes

    0

    0

    0

    1

    1


    Lan redundancy and votes73 l.jpg

    LAN redundancy and Votes

    0

    0

    0

    1

    1


    Lan redundancy and votes74 l.jpg

    LAN redundancy and Votes

    Subset A

    0

    0

    0

    1

    1

    Subset B

    Which subset of nodes is selected as the optimal sub-cluster?


    Lan redundancy and votes75 l.jpg

    LAN redundancy and Votes

    0

    0

    0

    1

    1

    One possible solution: redundant LAN adapters on servers


    Lan redundancy and votes76 l.jpg

    LAN redundancy and Votes

    1

    1

    1

    2

    2

    Another possible solution: Enough votes on LAN to outweigh any single server node


    Optimal sub cluster selection example two site cluster with unbalanced votes l.jpg

    Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

    1

    1

    0

    0

    Shadowsets


    Optimal sub cluster selection example two site cluster with unbalanced votes78 l.jpg

    Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

    1

    1

    0

    0

    Shadowsets

    Which subset of nodes is selected as the optimal sub-cluster?


    Optimal sub cluster selection example two site cluster with unbalanced votes79 l.jpg

    Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

    Nodes at this site continue

    Nodes at this site CLUEXIT

    1

    1

    0

    0

    Shadowsets

    Without nodes to MSCP-serve disks, these shadowset members go away


    Optimal sub cluster selection example two site cluster with unbalanced votes80 l.jpg

    Optimal Sub-cluster Selection example: Two-Site Cluster with Unbalanced Votes

    2

    2

    1

    1

    Shadowsets

    One possible solution


    Slide81 l.jpg

    OpenVMS Cluster Data Structures for the Connection Manager


    Openvms cluster data structures for the connection manager l.jpg

    OpenVMS Cluster Data Structuresfor the Connection Manager

    • Connection Manager Data Structures

      • CLUB (Cluster Block)

      • CSB (Cluster System Block)

      • CSV (Cluster System Vector)


    Connection manager data structures club cluster block l.jpg

    Connection Manager Data Structures: CLUB (Cluster Block)

    • Only one CLUB on a given node

    • Pointed to by CLU$GL_CLUB

    • Data displayed by

      $ SHOW CLUSTER/CONTINUOUS

      ADD CLUSTER/ALL


    Connection manager data structures club cluster block84 l.jpg

    Connection Manager Data Structures: CLUB (Cluster Block)

    • Data of interest:

      • Number of cluster members

      • Total number of votes

      • Current cluster-wide effective values of EXPECTED_VOTES and QUORUM

      • Various flags, like whether the cluster has quorum or not

      • Pointer to the Cluster System Block (CSB) for the local system

      • Data used during state transitions, like the identity of State Transition Coordinator Node, and what phase the transition is in

      • Time the cluster was formed, and time of last state transition


    Show cluster continuous display of cluster block club l.jpg

    SHOW CLUSTER/CONTINUOUS display of Cluster Block (CLUB)

    View of Cluster from system ID 14733 node: ALPHA1 13-SEP-2006 00:59:49

    +—————————————————————————————————————————————————————————————————————————————+

    | CLUSTER |

    |————————-————————-——————————-————————-———————————————————-———————————————————|

    | CL_EXP | CL_QUO | CL_VOTES | CL_MEM | FORMED | LAST_TRANSITION |

    |————————+————————+——————————+————————+———————————————————+———————————————————|

    | 10 | 6 | 10 | 11 | 17-MAY-2006 14:48 | 11-SEP-2006 15:18 |

    +————————-————————-——————————-————————-———————————————————-———————————————————+


    Connection manager data structures csb cluster system block l.jpg

    Connection Manager Data Structures:CSB (Cluster System Block)

    • Data displayed by

      $ SHOW CLUSTER/CONTINUOUS

      ADD MEMBERS/ALL

    • One CSB for each OpenVMS node this node is aware of


    Connection manager data structures csb cluster system block87 l.jpg

    Connection Manager Data Structures: CSB (Cluster System Block)

    • Data of interest:

      • Cluster System ID (CSID) of this node

      • Status of this node (e.g. cluster member or not)

      • State of our connection to this node

        • e.g. Newly discovered node, normal connection, communications lost but still within RECNXINTERVAL wait period, lost connection too long so node is no longer in cluster

      • This node’s private values for VOTES, EXPECTED_VOTES, QDSKVOTES, LOCKDIRWT

      • Cluster software version, and bit-mask of capabilities this node has (to allow support of cluster rolling upgrades despite adding new features)


    Show cluster continuous display of cluster block club88 l.jpg

    SHOW CLUSTER/CONTINUOUS display of Cluster Block (CLUB)

    View of Cluster from system ID 14733 node: ALPHA1 13-SEP-2006 01:09:00

    +——————————————————————————————————————————————————————————————————————————————+

    | MEMBERS |

    |————————-————————-———————-—————-————————-———————————————-—————————-———————————|

    | CSID | DIR_WT | VOTES | EXP | QUORUM | RECNXINTERVAL | SW_VERS | BITMAPS |

    |————————+————————+———————+—————+————————+———————————————+—————————+———————————|

    | 10061 | 1 | 1 | 1 | 1 | 60 | 6 | MINICOPY |

    | 1005E | 1 | 1 | 6 | 4 | 60 | 6 | MINICOPY |

    | 10032 | 0 | 0 | 5 | 3 | 60 | 6 | MINICOPY |

    | 1005D | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

    | 10057 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

    | 10064 | 1 | 1 | 10 | 6 | 60 | 6 | MINICOPY |

    | 1005F | 1 | 1 | 6 | 4 | 60 | 6 | MINICOPY |

    | 10058 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

    | 10060 | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

    | 1005A | 1 | 1 | 9 | 5 | 60 | 6 | MINICOPY |

    | 10043 | 1 | 1 | 5 | 3 | 60 | 6 | MINICOPY |

    +————————-————————-———————-—————-————————-———————————————-—————————-———————————+


    Connection manager data structures csv cluster system vector l.jpg

    Connection Manager Data Structures: CSV (Cluster System Vector)

    • The CSV is effectively a list of all systems

    • The Cluster System ID (CSID) acts as an index into the CSV

    • Incrementing sequence numbers within CSIDs allows re-use of CSV slots without confusion

      • This is similar to how the sequence number is incremented before a File ID is re-used


    Slide90 l.jpg

    Distributed Lock Manager


    Distributed lock manager l.jpg

    Distributed Lock Manager

    • The Lock Manager provides mechanisms for coordinating access to physical devices, both for exclusive access and for various degrees of sharing


    Distributed lock manager92 l.jpg

    Distributed Lock Manager

    • Examples of physical resources:

      • Tape drives

      • Disks

      • Files

      • Records within a file

    • as well as internal operating system cache buffers and so forth


    Distributed lock manager93 l.jpg

    Distributed Lock Manager

    • Physical resources are mapped to symbolic resource names, and locks are taken out and released on these symbolic resources to control access to the real resources


    Distributed lock manager94 l.jpg

    Distributed Lock Manager

    • System services $ENQ and $DEQ allow new lock requests, conversion of existing locks to different modes (or degrees of sharing), and release of locks, while $GETLKI allows the lookup of lock information


    Openvms cluster distributed lock manager l.jpg

    OpenVMS ClusterDistributed Lock Manager

    • Physical resources are protected by locks on symbolic resource names

    • Resources are arranged in trees:

      • e.g. File  Data bucket  Record

    • Different resources (disk, file, etc.) are coordinated with separate resource trees, to minimize contention


    Symbolic lock resource names l.jpg

    Symbolic lock resource names

    • Symbolic resource names

      • Common prefixes:

        • SYS$ for OpenVMS executive

        • F11B$ for XQP, file system

        • RMS$ for Record Management Services

      • See the book OpenVMS Internals and Data Structures by Ruth Goldenberg, et al

        • Appendix H in Alpha V1.5 version

        • Appendix A in Alpha V7.0 version


    Resource names l.jpg

    Resource names

    • Example: Device Lock

      • Resource name format is

        • “SYS$” {Device Name in ASCII text}

    3A3333324147442431245F24535953 SYS$_$1$DGA233:

    SYS$  (SYS$ facility)

    _$1$DGA233:  (Device name)


    Resource names98 l.jpg

    Resource names

    • Example: RMS lock tree for an RMS indexed file:

      • Resource name format is

        • “RMS$” {File ID} {Flags byte} {Lock Volume Name}

      • Identify filespec using File ID

      • Flags byte indicates shared or private disk mount

      • Pick up disk volume name

        • This is label as of time disk was mounted

    • Sub-locks are used for buckets and records within the file


    Decoding an rms file root resource name l.jpg

    Decoding an RMS File Root Resource Name

    RMS$t......FDDI_COMMON ...

    000000 204E4F4D4D4F435F49444446 02 000000011C74 24534D52

    24534D52 RMS$  (RMS Facility; RMS File Root Resource)

    00 00 0001 1C74

    RVN FilX Sequence_Number File_Number

     File ID [7284,1,0])

    02

    Flags byte (Disk is mounted /SYSTEM)

    204E4F4D4D4F435F49444446 FDDI_COMMON  (Disk label)

    $ dump/header/id=7284/block=count=0 –

    disk$FDDI_COMMON:[000000]indexf.sys

    Dump of file _DSA100:[SYSEXE]SYSUAF.DAT;1 ...  (File)


    Internal structure of an rms indexed file l.jpg

    Internal Structure of an RMS Indexed File


    Rms data bucket contents l.jpg

    RMS Data Bucket Contents

    Data Bucket

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record

    Data Record


    Rms indexed file bucket and record locks l.jpg

    RMS Indexed FileBucket and Record Locks

    • Sub-locks of RMS File Lock

      • Have to look at Parent lock to identify file

    • Bucket lock:

      • 4 bytes: VBN of first block of the bucket

    • Record lock:

      • 8 bytes: Record File Address (RFA) of record


    Distributed lock manager locking modes l.jpg

    Distributed Lock Manager:Locking Modes

    • Different types of locks allow different levels of sharing:

      • EX: Exclusive access: No other simultaneous access allowed

      • PW: Protected Write: Allows writing, with other read-only users allowed

      • PR: Protected Read: Allows reading, with other read-only users; no write access allowed

      • CW: Concurrent Write: Allows writing while others write

      • CR: Concurrent Read: Allows reading while others write

      • NL: Null (future interest in the resource)

    • Locks can be requested, released, and converted between modes


    Distributed lock manager lock mode combinations l.jpg

    Distributed Lock Manager:Lock Mode Combinations


    Distributed lock manager lock master nodes l.jpg

    Distributed Lock Manager:Lock Master nodes

    • OpenVMS assigns a single node at a time to keep track of all the resources in a given resource tree, and any locks taken out on those resources

      • This node is called the Lock Master node for that tree

      • Different trees often have different Lock Master nodes

      • OpenVMS dynamically moves Lock Mastership duties to the node with the most locking activity on that tree


    Distributed locks l.jpg

    Distributed Locks

    Node A

    Lock Master

    for resource X

    Node B

    Node C

    Lock on

    resource X

    Lock on

    resource X

    Lock on

    resource X

    Copy of Node B’s

    Lock on

    resource X

    Copy of Node C’s

    Lock on

    resource X


    Directory lookups l.jpg

    Directory Lookups

    • This is how OpenVMS finds out which node is the lock master

    • Only needed for 1st lock request on a particular resource tree on a given node

      • Resource Block (RSB) remembers master node CSID

    • Basic conceptual algorithm: Hash resource name and index into the lock directory weight vector, which has been created based on LOCKDIRWT values for each node


    Lock directory weight vector l.jpg

    Lock Directory Weight Vector

    Big node: PAPABR Lock Directory Weight = 2

    Middle-sized node: MAMABR Lock Directory Weight = 1

    Satellite node: BABYBR Lock Directory Weight = 0

    Resulting lock directory weight vector:

    PAPABR

    PAPABR

    MAMABR


    Lock directory lookup l.jpg

    Lock Directory Lookup

    Local node

    User lock

    Request

    User request

    satisfied

    Directory node

    Lock Master Node


    Performance hints l.jpg

    Performance Hints

    • Avoid $DEQing a lock which will be re-acquired later

      • Instead, convert to Null lock, and convert back up later

      • Or, create locks as sub-locks of a parent lock that remains held

    • This avoids the need for directory lookups


    Performance hints111 l.jpg

    Performance Hints

    • For lock trees which come and go often:

      • A separate program could take out a Null lock on the root resource on each node at boot time, and just hold it “forever”

      • This avoids lock directory lookup operations for that tree


    Lock request latencies l.jpg

    Lock Request Latencies

    • Latency depends on several things:

      • Directory lookup needed or not

        • Local or remote directory node

      • $ENQ or $DEQ operation (acquiring or releasing a lock)

      • Local (same node) or remote lock master node

        • And if remote, the speed of interconnect used


    Lock request latency local vs remote l.jpg

    Lock Request Latency: Local vs. Remote

    • Local requests are fastest

    • Remote requests are significantly slower:

      • Code path ~20 times longer

      • Interconnect also contributes latency

      • Total latency up to 2 orders of magnitude (100x) higher than local requests


    Lock request latency measurement l.jpg

    Lock Request Latency Measurement

    • I used the LOCKTIME program pair written by Roy G. Davis, author of the book VAXcluster Principles, to measure latency of lock requests locally and across different interconnects

    • LOCKTIME algorithm:

      • Take out 5000 locks on remote node, making it lock master for each of 5000 lock trees (requires ENQLM > 5000)

      • Do 5000 $ENQs, lock converts, and $DEQs, and calculate average latencies for each

      • Lock conversion request latency is roughly equivalent to round-trip time between nodes over a given interconnect

    • See LOCKTIME.COM from OpenVMS Freeware V6, [KP_LOCKTOOLS] directory


    Lock request latency local l.jpg

    Lock Request Latency: Local

    Client process on same node:2-4 microseconds

    Lock Master Node

    Client


    Client across gigabit ethernet 200 microseconds l.jpg

    Client across Gigabit Ethernet:200 microseconds

    Lock Request Latency: Remote

    Lock Master

    Client node

    Client

    GbE

    Switch


    Lock request latencies117 l.jpg

    Lock Request Latencies


    Lock mastership l.jpg

    Lock Mastership

    • Lock mastership node may change for various reasons:

      • Lock master node goes down -- new master must be elected

      • OpenVMS may move lock mastership to a “better” node for performance reasons

        • LOCKDIRWT imbalance found, or

        • Activity-based Dynamic Lock Remastering

        • Lock Master node no longer has interest


    Lock mastership119 l.jpg

    Lock Mastership

    • Lock master selection criteria:

      • Interest

        • Only move resource tree to a node which is holding at least some locks on that resource tree

      • Lock Directory Weight (LOCKDIRWT)

        • Move lock tree to a node with interest and a higher LOCKDIRWT

      • Activity Level

        • Move lock tree to a node with interest and a higher average activity level


    How to measure locking activity l.jpg

    How to measure locking activity

    • OpenVMS keeps counters of lock activity for each resource tree

      • but not for each of the sub-resources

    • So you can see the lock rate for an RMS indexed file, for example

      • but not for individual buckets or records within that file

    • SDA extension LCK in OpenVMS V7.2-2 and above can show lock rates, and even trace all lock requests if needed. This displays data on a per-node basis.

    • Cluster-wide summary is available using LOCK_ACTV*.COM from OpenVMS Freeware V6, [KP_LOCKTOOLS] directory


    Lock remastering l.jpg

    Lock Remastering

    • Circumstances under which remastering occurs, and does not:

      • LOCKDIRWT values

        • OpenVMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT

      • Shifting initiated based on activity counters in root RSB

        • PE1 parameter being non-zero can prevent movement or place threshold on lock tree size

      • Shift if existing lock master loses interest


    Lock remastering122 l.jpg

    Lock Remastering

    • OpenVMS rules for dynamic remastering decision based on activity levels:

      • assuming equal LOCKDIRWT values

    • 1) Must meet general threshold of at least 80 lock requests so far (LCK$GL_SYS_THRSH)

    • 2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH)


    Lock remastering123 l.jpg

    Lock Remastering

    • OpenVMS rules for dynamic remastering:

      • 3) Estimated cost to move (based on size of lock tree) must be less than estimated savings (based on lock rate)

        • except if new master meets criteria (2) for 3 consecutive 8-second intervals, cost is ignored

      • 4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA)


    Lock remastering124 l.jpg

    Lock Remastering

    • OpenVMS rules for dynamic remastering:

      • 5) If PE1 on the current master has a negative value, remastering trees off the node is disabled

      • 6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered


    Lock remastering125 l.jpg

    Lock Remastering

    • Implications of dynamic remastering rules:

      • LOCKDIRWT must be equal for lock activity levels to control choice of lock master node

      • PE1 can be used to control movement of lock trees OFF of a node, but not ONTO a node

      • RSB stores lock activity counts, so even high activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated


    Lock remastering126 l.jpg

    Lock Remastering

    • Implications of dynamic remastering rules:

      • With two or more nodes of equal horsepower running the same application, lock mastership “thrashing” can occur:

        • 10 more lock requests per second is not much of a difference when you may be doing 100s or 1,000s of lock requests per second

        • Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload


    How to detect lock mastership thrashing l.jpg

    How to Detect Lock Mastership Thrashing

    • Detection of remastering activity

      • MONITOR RLOCK in 7.3 and above (not 7.2-2)

      • SDA> SHOW LOCK/SUMMARY in 7.2 and above

      • Change of mastership node for a given resource

      • Check message counters under SDA:

        • SDA> EXAMINE PMS$GL_RM_RBLD_SENT

        • SDA> EXAMINE PMS$GL_RM_RBLD_RCVD

          • Counts which increase suddenly by a large amount indicate remastering of large tree(s)

            • SENT: Off of this node

            • RCVD: Onto this node

          • See example procedures WATCH_RBLD.COM and RBLD.COM


    Sda show lock summary l.jpg

    SDA> SHOW LOCK/SUMMARY

    $ analyze/system

    OpenVMS (TM) Alpha system analyzer

    SDA> show lock/summary

    ...

    Lock Manager Performance Counters:

    ----------------------------------

    ...

    Lock Remaster Counters:

    Tree moved to this node 0

    Tree moved to another node 0

    Tree moved due to higher Activity 0

    Tree moved due to higher LOCKDIRWT 0

    Tree moved due to Single Node Locks 0

    No Quota for Operation 0

    Proposed New Manager Declined 0

    Operations completed 0

    Remaster Messages Sent 0

    Remaster Messages Received 0

    Remaster Rebuild Messages Sent 0

    Remaster Rebuild Messages Received 0


    Monitor rlock l.jpg

    MONITOR RLOCK

    OpenVMS Monitor Utility

    DYNAMIC LOCK REMASTERING STATISTICS

    on node KEITH

    6-MAY-2002 18:42:35.07

    CUR AVE MIN MAX

    Lock Tree Outbound Rate 0.00 0.00 0.00 0.00

    (Higher Activity) 0.00 0.00 0.00 0.00

    (Higher LCKDIRWT) 0.00 0.00 0.00 0.00

    (Sole Interest) 0.00 0.00 0.00 0.00

    Remaster Msg Send Rate 0.00 0.00 0.00 0.00

    Lock Tree Inbound Rate 0.00 0.00 0.00 0.00

    Remaster Msg Receive Rate 0.00 0.00 0.00 0.00


    How to prevent lock mastership thrashing l.jpg

    How to Prevent Lock Mastership Thrashing

    • Unbalanced node power

    • Unequal workloads

    • Unequal values of LOCKDIRWT

    • Non-zero values of PE1


    Distributed lock manager resiliency l.jpg

    Distributed Lock Manager: Resiliency

    • The Lock Master node for a given resource tree knows about all locks on resources in that tree from all nodes

    • Each node also keeps track of its own locks

      Therefore:

    • If the Lock Master node for a given resource tree fails,

      • OpenVMS moves Lock Mastership duties to another node, and each node with locks tells the new Lock Master node about all their existing locks

    • If any other node fails, the Lock Master node frees (releases) any locks the now-departed node may have held


    Distributed lock manager scalability l.jpg

    Distributed Lock Manager: Scalability

    • For the first lock request on a given resource tree from a given node, OpenVMS hashes the resource name to pick a node (called the Directory node) to ask about locks

    • If the Directory Node does not happen to be the Lock Master node for that tree, it replies telling which node IS the Lock Master node for that tree

    • As long as a node holds ANY locks on a resource tree, it remembers which node is the Lock Master for the tree, so it can avoid doing directory lookups for that tree


    Distributed lock manager scalability133 l.jpg

    Distributed Lock Manager: Scalability

    • Thus, regardless of the number of nodes in a cluster, it takes AT MOST two off-node requests to resolve a lock request:

      • More often, only one request is needed (because we already know which node is the Lock Master), and

      • The majority of the time, NO off-node requests are needed, because OpenVMS tends to move Lock Master duties to the node with the most locking activity on the tree

    • Because Lock Directory and Lock Mastership duties are distributed among the nodes, this helps keep any single node from becoming a bottleneck in locking operations


    Deadlock searches l.jpg

    Deadlock searches

    • The OpenVMS Distributed Lock Manager automatically detects lock deadlock conditions, and generates an error to one of the programs causing the deadlock


    Deadlock searches135 l.jpg

    Deadlock searches

    • Deadlock searches can take lots of time and interrupt-state CPU time

    • MONITOR, T4, DECps, etc. can identify when deadlock searches are occurring

    • DEADLOCK_WAIT parameter controls how long we wait before starting a deadlock search


    Dedicated cpu lock manager l.jpg

    Dedicated-CPU Lock Manager

    • With 7.2-2 and above, you can choose to dedicate a CPU to do lock management work. This may help reduce MP_Synch time.

    • Using this can be helpful when:

      • You have more than 5 CPUs in the system, and

      • You’re already wasting more than a CPU’s worth in MP_Synch time contending for the LCKMGR spinlock

        • See SDA> SPL extension and SYS$EXAMPLES:SPL.COM

    • LCKMGR_MODE parameter:

      • 0 = Disabled

      • >1 = Enable if at least this many CPUs are running

    • LCKMGR_CPUID parameter specifies which CPU to dedicate to LCKMGR_SERVER process


    Slide137 l.jpg

    OpenVMS Cluster Data Structures for the Distributed Lock Manager


    Openvms cluster data structures for the distributed lock manager l.jpg

    OpenVMS Cluster Data Structuresfor the Distributed Lock Manager

    • Distributed Lock Manager

      • RSB (Resource Block)

      • LKB (Lock Block)

    • Data displayed by

      $ ANALYZE/SYSTEM

      SDA> SHOW RESOURCE

      SDA> SHOW LOCK


    Distributed lock manager data structures rsb resource block l.jpg

    Distributed Lock Manager Data Structures:RSB (Resource Block)

    • Data of interest:

      • Resource name and length

      • UIC Group (0 means System)

      • Access Mode (Kernel, Executive, Supervisor, User)

      • Resource Master (aka Lock Master) node’s CSID (0=local mastership)

      • Count of locks on this resource

      • Parent resource block, and depth within tree (0 for a Root RSB)

      • Count and queue of sub-resources, if any

      • Hash value for lookups via resource hash table

      • Activity counters used for dynamic lock remastering


    Distributed lock manager data structures lkb lock block l.jpg

    Distributed Lock Manager Data Structures:LKB (Lock Block)

    • Data of interest:

      • Pointer to associated resource block (RSB)

      • Lock mode (Kernel, Executive, Supervisor, User)

      • Process ID

      • State (e.g. granted, waiting, in conversion queue)

      • Parent lock, if any

      • Number of sub-locks

      • Lock ID of this lock

      • CSID of lock master node, and remote Lock ID there


    Slide141 l.jpg

    Progression of Lock Manager Improvements in OpenVMS


    Progression of lock manager improvements l.jpg

    Progression of Lock Manager Improvements

    VAX/VMS version 4.0:

    • Introduced VAXclusters and Distributed Lock Manager


    Progression of lock manager improvements143 l.jpg

    Progression of Lock Manager Improvements

    VAX/VMS version 5.0:

    • Problem recognized:

      • In every cluster state transition, a Full Rebuild operation of the cluster-wide distributed lock database was required -- basically throwing away all Master copies of locks, selecting new resource masters, and then re-acquiring all locks, rebuilding the lock directory information in the process. This could take multiple minutes to complete.

    • Solution:

      • Partial and Merge rebuilds introduced


    Progression of lock manager improvements144 l.jpg

    Progression of Lock Manager Improvements

    VAX/VMS version 5.2:

    • Problem recognized:

      • Resource mastership remains stuck on one node.

        • Even if that node lost all interest in a resource tree, as long as there was even 1 lock anywhere in the cluster, that node remained the resource master.

        • As nodes in a cluster rebooted one by one, the remaining nodes tended to accumulate resource mastership duties over time, resulting in unbalanced loads

    • Solution:

      • Resource tree remastering is introduced. Resource trees can now be moved between nodes. LOCKDIRWT parameter allows bias or “weighting” toward or away from specific nodes.


    Progression of lock manager improvements145 l.jpg

    Progression of Lock Manager Improvements

    VAX/VMS version 5.4:

    • Problem recognized:

      • Remastering of resource trees after a node leaves the cluster adds to the impact of state transitions

    • Solution:

      • Resources trees are proactively remastered away from a node before it completes shutdown, minimizing the cleanup work afterward


    Progression of lock manager improvements146 l.jpg

    Progression of Lock Manager Improvements

    VAX/VMS version 5.4:

    • Problem recognized:

      • Lock rebuild operation is single-threaded, stretching out lock rebuild times and increasing the adverse impact of a cluster state transition

    • Solution:

      • Lock rebuild operation is made to be multi-threaded


    Progression of lock manager improvements147 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 5.5:

    • Problem recognized:

      • Lock-request latency is lowest on the node which is the resource master, but VMS had no way of moving a resource tree to a busier node based on relative lock-request activity levels

    • Solution:

      • Activity-based resource tree remastering is introduced


    Progression of lock manager improvements148 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 5.5-2:

    • Problem recognized:

      • Original algorithms were a bit too eager to remaster resource trees

    • Solution:

      • Scans for remastering activity are now done once every 8 seconds instead of once per second


    Progression of lock manager improvements149 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS Alpha version 7.2:

    • Problem recognized:

      • System (S0/S1) address space is now often in short supply. Lock Manager data structures sometimes take up half of system address space on customers’ systems.

    • Solution:

      • Data structures for lock manager are moved into S2 (64-bit) address space.


    Progression of lock manager improvements150 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 7.3 (and 7.2-2):

    • Problem recognized:

      • Remastering of large resource trees can take many seconds, even minutes. It can take tens of thousands of SCS sequenced messages to complete a remastering operation. Locking is blocked during a remastering operation, so that adversely impacts customer response times.

    • Solution:

      • Resource tree remastering is modified so it can use block data transfers instead of sequenced messages. This results in a reduction in resource tree remastering times by a factor of 10 to 20.


    Progression of lock manager improvements151 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 8.3:

    • Problem recognized:

      • In clusters with multiple nodes of the same horsepower running the same applications, lock tree mastership can “thrash” around among the nodes.

        • One major cause is a hard-coded constant of only 10 lock requests per second needed to trigger a remastering operation.

    • Solution:

      • Algorithms for activity-based remastering are improved to add hysteresis and allow control with a new SYSGEN parameter LOCKRMWT.


    Progression of lock manager improvements152 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 8.3:

    • Problem recognized:

      • LOCKDIRWT parameter got overloaded as of version 5.2 with two meanings:

        • Lock Directory Weighting – relative amount of lock directory work this node should perform

        • Lock Remastering Weighting – relative priority of this node in mastering shared resource trees

    • Solution:

      • LOCKRMWT parameter introduced to cover (2). LOCKDIRWT returns to its original meaning (1).


    Progression of lock manager improvements153 l.jpg

    Progression of Lock Manager Improvements

    OpenVMS version 8.3:

    • Problem recognized:

      • Some applications (e.g. Rdb) encounter deadlocks on a regular basis as a part of normal operation. The minimum value of 1 second for the DEADLOCK_WAIT parameter means it can take too long to detect these deadlocks.

    • Solution:

      • The deadlock waiting period can be set on a per-process basis to as little as one-hundredth of a second if needed.


    Slide154 l.jpg

    Cluster State Transitions


    Cluster state transition topics l.jpg

    Cluster State Transition Topics

    • Purpose of cluster state transitions

    • The Distributed Lock Manager’s role in State Transitions

    • Things which trigger state transitions, and how to avoid them

    • Failures and failure detection mechanisms

    • Reconnection Interval


    Cluster state transition topics156 l.jpg

    Cluster State Transition Topics

    • Effects of a state transition

    • Sequence of events in a state transition

    • State-transition-related issues

    • Minimizing impact of state transitions


    Distributed lock manager and cluster state transitions l.jpg

    Distributed Lock Manager and Cluster State Transitions

    • The Distributed Lock Manager keeps information about locks and resources distributed around the cluster

    • Lock and Resource data may be affected by a node joining or leaving the cluster:

      • A node leaving may have had locks which need to be freed

      • Lock Master duties may need to be reassigned

      • Lock Directory Lookup distribution may also be affected when a node joins or leaves


    Implications of state transitions on lock directory vector l.jpg

    Implications of State Transitions on Lock Directory Vector


    Lock rebuild types l.jpg

    Lock Rebuild Types

    • When a node joins or leaves the cluster, the distributed resource/lock database needs to be rebuilt.

    • Lock requests may need to be temporarily stalled while the data is redistributed

    • There are 4 types of lock rebuild operations:

      • Merge

      • Partial

      • Directory

      • Full


    Lock rebuild types160 l.jpg

    Lock Rebuild Types

    • Merge:

      • Enter already-held lock tree into directory on directory node

        • Lock requests not stalled

    • Partial:

      • Release locks held by departed node, and remaster any resource trees it was resource master for


    Lock rebuild types161 l.jpg

    Lock Rebuild Types

    • Directory:

      • Do what a Partial rebuild does (clean up after departed node), and also rebuild the directory information

    • Full:

      • Throw away all master copies of locks, select new resource masters, and then re-acquire locks, rebuilding directory information in the process


    Lock rebuild types162 l.jpg

    Lock Rebuild Types


    State transition triggers l.jpg

    State Transition Triggers

    • A state transition is needed when:

      • A new member is added to the cluster

      • An existing member is removed from the cluster

      • An adjustment related to votes is needed:

        • shutdown with REMOVE_NODE option

        • quorum disk validation or invalidation

        • quorum is adjusted manually via:

          • IPC> Q routine

          • Availability Manager / DECamds Quorum Fix via RMDRIVER

          • $SET CLUSTER/EXPECTED_VOTES command in DCL


    New member detection on ci or dssi l.jpg

    New-member detectionon CI or DSSI

    PANUMPOLL = 2

    PAMAXPORT = 3

    PAPOLLINTERAL = 5

    Node Polled:

    Time t=0

    Request ID

    Node ID 0

    ID response

    Request ID

    PAPOLLINTERVAL

    Node ID 1

    ID response

    Time t=5

    Request ID

    Node ID 2

    ID response

    Request ID

    PAPOLLINTERVAL

    Node ID 3

    ID response

    Time t=10

    Request ID

    Node ID 0

    ID response


    Virtual circuit formation on ci or dssi l.jpg

    Virtual Circuit Formationon CI or DSSI

    Cluster member

    Request ID

    New node

    ID response

    Request ID

    ID response

    Start

    Start Acknowledge

    Acknowledge


    New member detection on ethernet or fddi l.jpg

    New-member detectionon Ethernet or FDDI

    Remote node

    Local node

    Hello or Solicit-Service

    Channel-Control Handshake

    Channel-Control Start

    Verify

    Verify Acknowledge

    Start

    SCS Handshake

    Start Acknowledge

    Acknowledge


    Failures l.jpg

    Failures

    • Types of failures:

      • Node failure

      • Adapter failure

      • Interconnect failure

    • Failure examples:

      • Power supply failure in network equipment

      • Operating system crash

      • Bridge reboot for firmware update, and spanning-tree reconfiguration which results


    Failure and repair recovery within reconnection interval l.jpg

    Failure and Repair/Recovery within Reconnection Interval

    Time

    Failure occurs

    Failure detected

    (virtual circuit

    broken)

    Problem fixed

    RECNXINTERVAL

    Fixed state detected

    (virtual circuit

    re-opened)


    Hard failure l.jpg

    Hard Failure

    Time

    Failure occurs

    Failure detected

    (virtual circuit

    broken)

    RECNXINTERVAL

    State transition

    (node removed

    from cluster)


    Late recovery l.jpg

    Late Recovery

    Time

    Failure occurs

    Failure detected

    (virtual circuit

    broken)

    RECNXINTERVAL

    State transition

    (node removed

    from cluster)

    Problem fixed

    Fix detected

    Node learns it has been removed from cluster

    Node does CLUEXIT

    bugcheck


    Failure detection mechanisms l.jpg

    Failure Detection Mechanisms

    • Mechanisms to detect a node or communications failure

      • Last-Gasp Datagram

      • Periodic checking

        • Multicast Hello packets on LANs

        • Polling on CI and DSSI

        • TIMVCFAIL check


    Timvcfail mechanism l.jpg

    TIMVCFAIL Mechanism

    Local node

    Remote node

    Time t=0

    Request

    Response

    Time t=1/3 of TIMVCFAIL

    Time t=2/3 of TIMVCFAIL

    Request

    Response


    Timvcfail mechanism173 l.jpg

    TIMVCFAIL Mechanism

    Local node

    Remote node

    Time t=0

    Request

    Response

    1

    Node fails

    some time during

    this period

    Time t=1/3 of TIMVCFAIL

    Time t=2/3 of TIMVCFAIL

    Request

    2

    Time t=TIMVCFAIL

    Virtual circuit broken


    Sequence of events during a state transition l.jpg

    Sequence of eventsDuring a State Transition

    • Determine new cluster configuration

      • If quorum is lost:

        • QUORUM capability bit removed from all CPUs

          • no process can be scheduled to run

        • Disks all put into mount verification

      • If quorum is not lost, continue…

    • Rebuild lock database

      • Stall lock requests

      • I/O synchronization

      • Do rebuild work

      • Resume lock handling


    Measuring state transition effects l.jpg

    Measuring State Transition Effects

    • Determine the type of the last lock rebuild:

      $ ANALYZE/SYSTEM

      SDA> READ SYS$LOADABLE_IMAGES:SCSDEF

      SDA> EVALUATE @(@CLU$GL_CLUB + CLUB$B_NEWRBLD_REQ) & FF

      Hex = 00000002 Decimal = 2 ACP$V_SWAPPRV

    • Rebuild type values:

      • Merge (locking not disabled)

      • Partial

      • Directory

      • Full


    Measuring state transition effects176 l.jpg

    Measuring State Transition Effects

    • Determine the duration of the last lock request stall period:

      SDA> DEFINE TOFF = @(@CLU$GL_CLUB+CLUB$L_TOFF)

      SDA> DEFINE TON = @(@CLU$GL_CLUB+CLUB$L_TON)

      SDA> EVALUATE TON-TOFF

      Hex = 00000000.00000044 Decimal = 68 ADP$L_PROBE_CMD

      This value is in units of 10-millisecond clock ticks (hundredths of a second), so this represents the period of lock stalling lasted 0.68 seconds


    Minimizing impact of state transitions l.jpg

    Minimizing Impactof State Transitions

    • Configurations issues:

      • Few (e.g. exactly 3) nodes

      • Quorum node; no quorum disk

      • Isolate cluster interconnects from other traffic

    • Operational issues:

      • Avoid disk rebuilds on reboot

      • Let Last-Gasp work


    System parameters which affect state transition impact l.jpg

    System Parameters which affect State Transition Impact

    • Failure detection time:

      • TIMVCFAIL can be set as low as 100 (1 second) if needed

        • If cluster interconnect is the LAN, the minimum value for TIMVCFAIL is a value one second longer than the PEDRIVER Listen Timeout. The PEDRIVER Listen Timeout value may be viewed using SDA:

          $ ANALYZE/SYSTEM

          SDA> SHOW PORTS !This is to define symbols

          SDA> SHOW PORT/ADDR=PE_PDT      and look for output like:

           ..."Listen Timeout       8     Hello Interval     150“

    • Reconnection interval after failure detection:

      • RECNXINTERVAL can be set as low as 1 second if needed


    Cluster server process csp l.jpg

    Cluster Server process (CSP)

    • Runs on each node

    • Assists with cluster-wide operations that require process context on a remote node, such as:

      • Mounting and dismounting volumes:

        • $MOUNT/CLUSTER and $DISMOUNT/CLUSTER

      • $BRKTHRU system service and $REPLY command

      • $SET TIME/CLUSTER

      • Distributed OPCOM communications

      • Interface between SYSMAN and SMISERVER on remote nodes

      • Cluster-Wide Process Services (CWPS)

        • Allow startup, monitoring, and control of processes on remote nodes


    Mscp tmscp servers l.jpg

    MSCP/TMSCP Servers

    • Implement Mass Storage Control Protocol

    • Provide access to disks [MSCP] and tape drives [TMSCP] for systems which do not have a direct connection to the device, through a node which does have direct access and has [T]MSCP Server software loaded

    • OpenVMS includes an MSCP server for disks and tapes


    Mscp tmscp servers181 l.jpg

    MSCP/TMSCP Servers

    • MSCP disk and tape controllers (e.g. HSJ80, HSD30) also include a [T]MSCP Server, and talk the SCS protocol, but do not have a Connection Manager, so are not cluster members


    Openvms support for redundant hardware l.jpg

    OpenVMS Support for Redundant Hardware

    • OpenVMS is very good about supporting multiple (redundant) pieces of hardware

      • Nodes

      • LAN adapters

      • Storage adapters

      • Disks

        • Volume Shadowing provides RAID-1 (mirroring)

    • OpenVMS is very good at automatic:

      • Failure detection

      • Fail-over

      • Fail-back

      • Load balancing or load distribution


    Direct vs mscp served paths l.jpg

    Direct vs. MSCP-Served Paths

    Node

    Node

    SCSI Hub


    Direct vs mscp served paths184 l.jpg

    Direct vs. MSCP-Served Paths

    Node

    Node

    SCSI Hub


    Direct vs mscp served paths185 l.jpg

    Direct vs. MSCP-Served Paths

    Node

    Node

    FC Switch

    FC Switch

    Shadowset


    Direct vs mscp served paths186 l.jpg

    Direct vs. MSCP-Served Paths

    Node

    Node

    FC Switch

    FC Switch

    Shadowset


    Direct vs mscp served paths187 l.jpg

    Direct vs. MSCP-Served Paths

    Node

    Node

    FC Switch

    FC Switch

    Shadowset


    Slide188 l.jpg

    OpenVMS Version 8.3 Changes


    Openvms 8 3 changes improved remastering algorithm l.jpg

    OpenVMS 8.3 Changes:Improved Remastering Algorithm

    • New SYSGEN parameter LOCKRMWT (Lock Remaster Weight)

      • Value can be from 0 to 10

        • 0 means node never wants to be resource master

        • 10 means node always wants to be resource master

        • Values between 1 and 9 indicate a relative bias toward being resource master

      • Default value is 5

      • Difference between values on different nodes influences remastering activity (more on this shortly)

      • Parameter is dynamic


    Openvms 8 3 changes improved remastering algorithm190 l.jpg

    OpenVMS 8.3 Changes:Improved Remastering Algorithm

    • New SYSGEN parameter LOCKRMWT (Lock Remaster Weight)

      • If a node with LOCKRMWT of 0-9 receives a lock request from a node with a LOCKRMWT of 10, the tree will be marked for remastering to the node with a value of 10

      • If a node with LOCKRMWT of 0 receives a lock request from a node with any higher value, the tree will be marked for remastering

      • If a node with LOCKRMWT of 1-9 receives a lock request from another node with LOCKRMWT of 1-9, it computes the difference

        • Difference will be in the range of +8 to -8

        • Calculation on this difference results in a bias based on relative percentages of lock activity (not fixed rates)


    Openvms 8 3 changes improved remastering algorithm191 l.jpg

    OpenVMS 8.3 Changes:Improved Remastering Algorithm


    Openvms 8 3 changes improved remastering algorithm192 l.jpg

    OpenVMS 8.3 Changes:Improved Remastering Algorithm

    • Examples:

      • If a node with LOCKRMWT of 9 receives a lock request from a node with a value of 1 ((Local-Remote difference of +8), then the node with a value of 1 must be generating at least 3 times as many (200% more) lock requests for the resource tree to be marked for remastering

      • If a node with LOCKRMWT of 1 receives a lock request from another node with LOCKRMWT of 9 (Local-Remote difference of -8), then if the node with LOCKRMWT of 9 is generating at least half (50%) as many lock requests as the node with LOCKRMWT of 1 the resource tree will be marked for remastering

      • Nodes all have a default LOCKRMWT value of 5:

        • A node will have to have at least 13% more lock requests than the current resource master before lock tree will be marked for remastering

        • This introduces some “hysteresis”, to prevent thrashing

        • This algorithm automatically scales with increasing lock rates over time


    Openvms 8 3 changes improved remastering algorithm193 l.jpg

    OpenVMS 8.3 Changes:Improved Remastering Algorithm

    Mixed Version Clusters:

    • 8.3 nodes must interact properly with pre-8.3 nodes

    • In dealing with a pre-8.3 node, an 8.3 node must basically follow the old rules regarding LOCKDIRWT:

      • An 8.3 node will remaster a lock tree to a pre V8.3 node if the pre-8.3 node has a higher LOCKDIRWT

      • If LOCKDIRWT values are equal, then when an 8.3 node receives a lock request from a pre-8.3 node, the new computed activity threshold is used on the 8.3 node for the remastering decision

    • There are some cases that require special rules:


    Endless remastering l.jpg

    Resource

    Endless Remastering?

    Node A – V8.3

    LOCKDIRWT: 2

    LOCKRMWT: 0

    Node B – V8.3

    LOCKDIRWT: 0

    LOCKRMWT: 10

    Node C – V7.3-2

    LOCKDIRWT: 1

    Slide by Greg Jordan


    Endless remastering195 l.jpg

    Resource

    X

    Never remaster to a pre-8.3 node if an

    interested node has a higher LOCKDIRWT

    Endless Remastering?

    Node A – V8.3

    LOCKDIRWT: 2

    LOCKRMWT: 0

    Node B – V8.3

    LOCKDIRWT: 0

    LOCKRMWT: 10

    Node C – V7.3-2

    LOCKDIRWT: 1

    Slide by Greg Jordan


    Openvms 8 3 changes option to reduce deadlock wait l.jpg

    OpenVMS 8.3 Changes:Option to Reduce Deadlock Wait

    • SYSGEN parameter DEADLOCK_WAIT is in units of seconds; minimum 1 second

    • Some applications incur deadlocks as a matter of course (e.g. Rdb database)

    • Code is has been implemented in 8.3 to allow a process to request a sub-second deadlock wait time:

      • Call system service $SET_PROCESS_PROPERTIES with an item code of PPROP$C_DEADLOCK_WAIT and an additional parameter with a delta time ranging from 0 seconds to 1 second


    Openvms 8 3 changes pedriver window sizes l.jpg

    OpenVMS 8.3 Changes:PEDRIVER window sizes

    • Prior to 8.3, PEDRIVER had a limit of 31 unacknowledged packets (“window size”)

    • In multi-site clusters with long inter-site distances (i.e. hundreds of miles) or with faster LAN adapters (e.g. 10-gigabit), this could throttle activities like lock requests

    • Code is has been implemented in 8.3 to allow the PEDRIVER transmit window size to be larger:

      • PEDRIVER automatically adjusts window size based on aggregate link speed, but

      • Can manually calculate and set window sizes using SCACP:

        • SCACP>CALCULATE WINDOW_SIZE /SPEED=n –

          /DISTANCE=[KILOMETERS or MILES]=d

        • SCACP>SET VC /RECEIVE_WINDOW=x and

        • SCACP>SET VC /TRANSMIT_WINDOW=x

          • (Be sure you enlarge Receive side first; Cut back transmit side first)


    Openvms 8 3 changes 10 gigabit nic for integrity servers l.jpg

    OpenVMS 8.3 Changes:10-Gigabit NIC for Integrity Servers

    • 8.3 has latent support for 10-Gigabit Ethernet on Integrity Servers

      • Support for the AB287A 10 Gigabit PCI-X LAN adapter

      • The AB287A connects to a LAN operating at 10 gigabits per second in full-duplex mode

      • The device can sustain throughput close to the bandwidth offered by the PCI or PCI-X slot it is plugged into (roughly 7 gigabits for PCI-X at 133 Mhz)

    • This is expected to reduce lock request latency to some degree compared with 1-gigabit NICs, similar to the improvement of 1-gigabit over 100 megabit (Fast Ethernet) NICs (perhaps 1/3)


    Openvms cluster resources l.jpg

    OpenVMS Cluster Resources

    • OpenVMS Documentation on the Web:

      • http://h71000.www7.hp.com/doc

        • OpenVMS Cluster Systems

        • Guidelines for OpenVMS Cluster Configurations

        • HP Volume Shadowing for OpenVMS

    • Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9


    Openvms cluster resources200 l.jpg

    OpenVMS Cluster Resources

    • OpenVMS Hobbyist Program (free licenses for OpenVMS, OpenVMS Cluster Software, compilers, etc.) at http://openvmshobbyist.org/

    • VAX emulators for Windows and Linux

      • simh by Bob Supnik: http://simh.trailing-edge.com/

      • TS-10 by Timothy Stark: http://sourceforge.net/projects/ts10/

      • Charon-VAX: http://www.charon-vax.com/


    Openvms cluster resources201 l.jpg

    OpenVMS Cluster Resources

    • HP IT Resource Center Forums on OpenVMS:

      • http://forums1.itrc.hp.com/service/forums/familyhome.do?familyId=288

    • Encompasserve (aka DECUServe)

      • OpenVMS system with free accounts and a friendly community of OpenVMS users

        • Telnet to encompasserve.org and log in under username REGISTRATION

    • DeathRow OpenVMS Cluster

      • http://deathrow.vistech.net/

    • Usenet newsgroups:

      • comp.os.vms, vmsnet.*, comp.sys.dec


    Speaker contact info l.jpg

    Speaker Contact Info:

    • Keith Parris

    • E-mail: [email protected]

    • or [email protected]

    • Web: http://www2.openvms.org/kparris/


  • Login