using openvms clusters for disaster tolerance
Download
Skip this Video
Download Presentation
Using OpenVMS Clusters for Disaster Tolerance

Loading in 2 Seconds...

play fullscreen
1 / 321

Using OpenVMS Clusters for Disaster Tolerance - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Using OpenVMS Clusters for Disaster Tolerance. Keith Parris Systems/Software Engineer June 16, 2008. Introduction to Disaster Tolerance Using OpenVMS Clusters. Media coverage of OpenVMS and Disaster Tolerance.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using OpenVMS Clusters for Disaster Tolerance' - hedley-griffin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using openvms clusters for disaster tolerance

Using OpenVMS Clusters for Disaster Tolerance

Keith Parris

Systems/Software Engineer

June 16, 2008

slide4

“Among operating systems, OpenVMS and Unix seem to be favored more than others. Alpha/OpenVMS, for example, has built-in clustering technology that many companies use to mirror data between sites. Many financial institutions, including Commerzbank, the International Securities Exchange and Deutsche Borse AG, rely on VMS-based mirroring to protect their heavy-duty transaction-processing systems.”

“Disaster Recovery: Are you ready for trouble?” by Drew Robb

Computerworld, April 25, 2005

http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,101249,00.html

slide5

“Deutsche Borse, a German exchange for stocks and derivatives, has deployed an OpenVMS cluster over two sites situated 5 kilometers apart. … ‘DR is not about cold or warm backups, it\'s about having your data active and online no matter what,’ says Michael Gruth, head of systems and network support at Deutsche Borse. ‘That requires cluster technology which is online at both sites.’"

“Disaster Recovery: Are you ready for trouble?” by Drew Robb

Computerworld, April 25, 2005

http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,101249,00.html

high availability ha
High Availability (HA)
  • Ability for application processing to continue with high probability in the face of common (mostly hardware) failures
  • Typical technique: redundancy

2N

mirroring

duplexing

3N

triplex

N+1

sparing

high availability ha1
High Availability (HA)
  • Typical technologies:
    • Redundant power supplies and fans
    • RAID / mirroring for disks
    • Clusters of servers
    • Multiple NICs, redundant routers
    • Facilities: Dual power feeds, n+1 Air Conditioning units, UPS, generator
fault tolerance ft
Fault Tolerance (FT)
  • Ability for a computer system to continue operating despite hardware and/or software failures
  • Typically requires:
    • Special hardware with full redundancy, error-checking, and hot-swap support
    • Special software
  • Provides the highest availability possible within a single datacenter

VAXft

NonStop

disaster recovery dr
Disaster Recovery (DR)
  • Disaster Recovery is the ability to resume operations after a disaster
    • Disaster could be as bad as destruction of the entire datacenter site and everything in it
    • But many events short of total destruction can also disrupt service at a site:
      • Power loss in the area for an extended period of time
      • Bomb threat (or natural gas leak) prompting evacuation of everyone from the site
      • Water leak
      • Air conditioning failure
disaster recovery dr1
Disaster Recovery (DR)
  • Basic principle behind Disaster Recovery:
    • To be able to resume operations after a disaster implies off-site data storage of some sort
disaster recovery dr2
Disaster Recovery (DR)
  • Typically,
    • There is some delay before operations can continue (many hours, possibly days), and
    • Some transaction data may have been lost from IT systems and must be re-entered
  • Success hinges on ability to restore, replace, or re-create:
      • Data (and external data feeds)
      • Facilities
      • Systems
      • Networks
      • User access
dr methods
DR Methods
  • Tape Backup
  • Expedited hardware replacement
  • Vendor Recovery Site
  • Data Vaulting
  • Hot Site
dr methods tape backup
DR Methods:Tape Backup
  • Data is copied to tape, with off-site storage at a remote site
  • Very-common method. Inexpensive.
  • Data lost in a disaster is:
    • All the changes since the last tape backup that is safely located off-site
  • There may be significant delay before data can actually be used
dr methods expedited hardware replacement
DR Methods:Expedited Hardware Replacement
  • Vendor agrees that in the event of a disaster, a complete set of replacement hardware will be shipped to the customer within a specified (short) period of time
    • HP has Quick Ship program
  • Typically there would be at least several days of delay before data can be used
dr methods vendor recovery site
DR Methods:Vendor Recovery Site
  • Vendor provides datacenter space, compatible hardware, networking, and sometimes user work areas as well
    • When a disaster is declared, systems are configured and data is restored to them
  • Typically there are hours to days of delay before data can actually be used
dr methods data vaulting
DR Methods:Data Vaulting
  • Copy of data is saved at a remote site
    • Periodically or continuously, via network
    • Remote site may be own site or at a vendor location
  • Minimal or no data may be lost in a disaster
  • There is typically some delay before data can actually be used
dr methods hot site
DR Methods:Hot Site
  • Company itself (or a vendor) provides pre-configured compatible hardware, networking, and datacenter space
  • Systems are pre-configured, ready to go
    • Data may already resident be at the Hot Site thanks to Data Vaulting
  • Typically there are minutes to hours of delay before data can be used
disaster tolerance vs disaster recovery
Disaster Tolerance vs.Disaster Recovery
  • Disaster Recovery is the ability to resume operations after a disaster.
  • Disaster Tolerance is the ability tocontinueoperations uninterrupted despite a disaster.
slide20

“ Disaster Tolerance… Expensive? Yes. But if an hour of downtime costs you millions of dollars, or could result in loss of life, the price is worth paying. That\'s why big companies are willing to spend big to achieve disaster tolerance.”

Enterprise IT Planet, August 18, 2004

http://www.enterpriseitplanet.com/storage/features/article.php/3396941

disaster tolerant hp platforms
Disaster-Tolerant HP Platforms
  • OpenVMS
  • HP-UX and Linux
  • Tru64
  • NonStop
  • Microsoft
disaster tolerance ideals
Disaster Tolerance Ideals
  • Ideally, Disaster Tolerance allows one to continue operations uninterrupted despite a disaster:
    • Without any appreciable delays
    • Without any lost transaction data
disaster tolerance vs disaster recovery1
Disaster Tolerance vs. Disaster Recovery
  • Businesses vary in their requirements with respect to:
    • Acceptable recovery time
    • Allowable data loss
  • So some businesses need only Disaster Recovery, and some need Disaster Tolerance
    • And many need DR for some (less-critical) functions and DT for other (more-critical) functions
disaster tolerance vs disaster recovery2
Disaster Tolerance vs. Disaster Recovery
  • Basic Principle:
    • Determine requirements based on business needs first,
      • Then find acceptable technologies to meet the needs of each area of the business
disaster tolerance and business needs
Disaster Tolerance and Business Needs
  • Even within the realm of businesses needing Disaster Tolerance, business requirements vary with respect to:
    • Acceptable recovery time
    • Allowable data loss
  • Technologies also vary in their ability to achieve the Disaster Tolerance ideals of no data loss and zero recovery time
  • So we need ways of measuring needs and comparing different solutions
quantifying disaster tolerance and disaster recovery requirements
Quantifying Disaster Tolerance and Disaster Recovery Requirements
  • Commonly-used metrics:
    • Recovery Point Objective (RPO):
      • Amount of data loss that is acceptable, if any
    • Recovery Time Objective (RTO):
      • Amount of downtime that is acceptable, if any
recovery point objective rpo
Recovery Point Objective (RPO)
  • Recovery Point Objective is measured in terms of time
  • RPO indicates the point in time to which one is able to recover the data after a failure, relative to the time of the failure itself
  • RPO effectively quantifies the amount of data loss permissible before the business is adversely affected

Recovery Point Objective

Time

Disaster

Backup

recovery time objective rto
Recovery Time Objective (RTO)
  • Recovery Time Objective is also measured in terms of time
  • Measures downtime:
    • from time of disaster until business can continue
  • Downtime costs vary with the nature of the business, and with outage length

Recovery Time Objective

Time

Business

Resumes

Disaster

disaster tolerance vs disaster recovery based on rpo and rto metrics
Disaster Tolerance vs. Disaster Recoverybased on RPO and RTO Metrics

Increasing Data Loss

Recovery Point Objective

Disaster Recovery

Disaster Tolerance

Zero

Zero

Recovery Time Objective

Increasing Downtime

examples of business requirements and rpo rto values
Examples of Business Requirementsand RPO / RTO Values
  • Greeting card manufacturer
    • RPO = zero; RTO = 3 days
  • Online stock brokerage
    • RPO = zero; RTO = seconds
  • ATM machine
    • RPO = hours; RTO = minutes
  • Semiconductor fabrication plant
    • RPO = zero; RTO = minutes
      • but data protection by geographical separation is not needed
recovery point objective rpo1
Recovery Point Objective (RPO)
  • RPO examples, and technologies to meet them:
    • RPO of 24 hours:
      • Backups at midnight every night to off-site tape drive, and recovery is to restore data from set of last backup tapes
    • RPO of 1 hour:
      • Ship database logs hourly to remote site; recover database to point of last log shipment
    • RPO of a few minutes:
      • Mirror data asynchronously to remote site
    • RPO of zero:
      • Shadow data to remote site (strictly synchronous replication)
recovery time objective rto1
Recovery Time Objective (RTO)
  • RTO examples, and technologies to meet them:
    • RTO of 72 hours:
      • Restore tapes to configure-to-order systems at vendor DR site
    • RTO of 12 hours:
      • Restore tapes to system at hot site with systems already in place
    • RTO of 4 hours:
      • Data vaulting to hot site with systems already in place
    • RTO of 1 hour:
      • Disaster-tolerant cluster with controller-based cross-site disk mirroring
recovery time objective rto2
Recovery Time Objective (RTO)
  • RTO examples, and technologies to meet them:
    • RTO of 10 seconds:
      • Disaster-tolerant cluster with:
        • Redundant inter-site links, carefully configured
          • To avoid bridge Spanning Tree Reconfiguration delay
        • Host-based Volume Shadowing for data replication
          • To avoid time-consuming manual failover process with controller-based mirroring
        • Tie-breaking vote at a 3rd site
          • To avoid loss of quorum after site failure
        • Distributed Lock Manager and Cluster-Wide File System (or the equivalent in database software), allowing applications to run at both sites simultaneously
          • To avoid having to start applications at failover site after the failure
clustering
Clustering
  • Allows a set of individual computer systems to be used together in some coordinated fashion
cluster types
Cluster types
  • Different types of clusters meet different needs:
    • Scalability clusters allow multiple nodes to work on different portions of a sub-dividable problem
      • Workstation farms, compute clusters, Beowulf clusters
    • Availability clusters allow one node to take over application processing if another node fails
  • For Disaster Tolerance, we’re talking primarily about Availability clusters
    • (geographically dispersed)
high availability clusters
High Availability Clusters
  • Transparency of failover and degrees of resource sharing differ:
    • “Shared-Nothing” clusters
    • “Shared-Storage” clusters
    • “Shared-Everything” clusters
shared nothing clusters
“Shared-Nothing” Clusters
  • Data may be partitioned among nodes
  • Only one node is allowed to access a given disk or to run a specific instance of a given application at a time, so:
    • No simultaneous access (sharing) of disks or other resources is allowed (and this must be enforced in some way), and
    • No method of coordination of simultaneous access (such as a Distributed Lock Manager) exists, since simultaneous access is never allowed
shared storage clusters
“Shared-Storage” Clusters
  • In simple “Fail-over” clusters, one node runs an application and updates the data; another node stands idly by until needed, then takes over completely
  • In more-sophisticated clusters, multiple nodes may access data, but typically one node at a time “serves” a file system to the rest of the nodes, and performs all coordination for that file system
shared everything clusters
“Shared-Everything” Clusters
  • “Shared-Everything” clusters allow any application to run on any node or nodes
    • Disks are accessible to all nodes under a Cluster File System
    • File sharing and data updates are coordinated by a Lock Manager
cluster file system
Cluster File System
  • Allows multiple nodes in a cluster to access data in a shared file system simultaneously
  • View of file system is the same from any node in the cluster
distributed lock manager
Distributed Lock Manager
  • Allows systems in a cluster to coordinate their access to shared resources, such as:
    • Mass-storage devices (disks, tape drives)
    • File systems
    • Files, and specific data within files
    • Database tables
multi site clusters
Multi-Site Clusters
  • Consist of multiple sites with one or more systems, in different locations
  • Systems at each site are all part of the same cluster
  • Sites are typically connected by bridges (or bridge-routers
    • Pure routers don’t pass the SCS cluster protocol used within OpenVMS clusters)
      • Work is underway to support for IP as a cluster interconnect option, as noted in the OpenVMS Roadmap
trends
Trends
  • Increase in disasters
  • Longer inter-site distances for better protection
  • Business pressures for shorter distances for performance
  • Increasing pressure not to bridge LANs between sites
slide51

Trends

  • Trends
    • Increase in Disasters
slide52

“Natural disasters have quadrupled over the last two decades, from an average of 120 a year in the early 1980s to as many as 500 today.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

slide53

“There has been a six-fold increase in floods since 1980. The number of floods and wind-storms have increased from 60 in 1980 to 240 last year.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

slide54

“Meanwhile the number of geothermal events, such as earthquakes and volcanic eruptions, has stayed relatively static.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

increase in disasters
Increase in Disasters

http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download

increase in number killed by disasters
Increase in Number Killed by Disasters

http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download

slide57

Trends

  • Trends
    • Longer inter-site distances for better protection
slide58

“Some CIOs are imagining potential disasters that go well beyond the everyday hiccups that can disrupt applications and networks. Others, recognizing how integral IT is to business today, are focusing on the need to recover instantaneously from any unforeseen event.” …“It\'s a different world. There are so many more things to consider than the traditional fire, flood and theft.”

“Redefining Disaster“

Mary K. Pratt, Computerworld, June 20, 2005

http://www.computerworld.com/hardwaretopics/storage/story/0,10801,102576,00.html

slide61

“The blackout has pushed many companies to expand their data center infrastructures to support data replication between two or even three IT facilities -- one of which may be located on a separate power grid.”

Computerworld, August 2, 2004

http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,94944,00.html

slide62

“You have to be far enough apart to make sure that conditions in one place are not likely to be duplicated in the other.“… “A useful rule of thumb might be a minimum of about 50 km, the length of a MAN, though the other side of the continent might be necessary to play it safe.”

“Disaster Recovery Sites: How Far Away is Far Enough?”

Drew Robb, Datamation, October 4, 2005

http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

trends longer inter site distances for better protection
Trends:Longer inter-site distances for better protection
  • In the past, protection was focused against risks like fires, floods, tornadoes. 1 to 5 miles was fine between sites.
  • Right after 9/11, 60 to100 miles looked much better.
  • After the Northeast Blackout of 2003, and increasing awareness of the possibility of a terrorist group obtaining a nuclear device and wiping out an entire metropolitan area is no longer inconceivable.
    • Resulting pressure is for inter-site distances of 1,000 to 1,500 miles
  • Challenges:
    • Telecommunications links
    • Latency due to speed of light adversely affects performance
slide64

Trends

  • Trends
    • Business pressures for shorter distances for performance
slide65

“A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm, by one estimate.”

Richard Martin, InformationWeek,

April 23, 2007

slide66

“The fastest systems, running from traders\' desks to exchange data centers, can execute transactions in a few milliseconds -- so fast, in fact, that the physical distance between two computers processing a transaction can slow down how fast it happens.”

Richard Martin, InformationWeek,

April 23, 2007

slide67

“This problem is called data latency -- delays measured in split seconds. To overcome it, many high-frequency algorithmic traders are moving their systems as close to the Wall Street exchanges as possible.”

Richard Martin, InformationWeek,

April 23, 2007

slide68

Trends

  • Trends
    • Increasing pressure not to bridge LANs between sites
trends increasing resistance to lan bridging
Trends:Increasing Resistance to LAN Bridging
  • In the past, setting up a VLAN spanning sites for an OpenVMS disaster-tolerant cluster was common
  • Networks are now IP-centric
  • IP network mindset sees LAN bridging as “bad,” sometimes even “totally unacceptable”
  • Alternatives:
    • Separate, private link for OpenVMS Multi-site Cluster
    • Metropolitan Area Networks (MANs) using MPLS
    • Ethernet-over-IP (EoIP)
    • SCS-over-IP support planned for OpenVMS 8.4
disaster tolerant clusters foundation
Disaster-Tolerant Clusters:Foundation
  • Goal: Survive loss of an entire datacenter (or 2)
  • Foundation:
    • Two or more datacenters a “safe” distance apart
    • Cluster software for coordination
    • Inter-site link for cluster interconnect
    • Data replication of some sort for 2 or more identical copies of data, one at each site
disaster tolerant clusters foundation1
Disaster-Tolerant Clusters:Foundation
  • Foundation:

Management and monitoring tools

      • Remote system console access or KVM system
      • Failure detection and alerting, for things like:
        • Network monitoring (especially for inter-site link)
        • Shadowset member loss
        • Node failure
      • Quorum recovery tool or mechanism (for 2-site clusters with balanced votes)
    • HP toolsets for managing OpenVMS DT Clusters:
      • Tools included in DTCS package (Windows-based)
      • CockpitMgr (OpenVMS-based)
disaster tolerant clusters foundation2
Disaster-Tolerant Clusters:Foundation
  • Foundation:

Knowledge and Implementation Assistance

    • Feasibility study, planning, configuration design, and implementation assistance, plus staff training
      • HP recommends HP Disaster Tolerant Services consulting services to meet this need:
        • http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx
disaster tolerant clusters foundation3
Disaster-Tolerant Clusters:Foundation
  • Foundation:

Procedures and Documentation

    • Carefully-planned (and documented) procedures for:
      • Normal operations
      • Scheduled downtime and outages
      • Detailed diagnostic and recovery action plans for various failure scenarios
disaster tolerant clusters foundation4
Disaster-Tolerant Clusters:Foundation
  • Foundation:

Data Replication

    • Data is constantly replicated to or copied to a 2nd site (& possibly a 3rd), so data is preserved in a disaster
    • Solution must also be able to redirect applications and users to the site with the up-to-date copy of the data
    • Examples:
      • OpenVMS Volume Shadowing Software
      • Continuous Access for EVA or XP
      • Database replication
      • Reliable Transaction Router
disaster tolerant clusters foundation5
Disaster-Tolerant Clusters:Foundation
  • Foundation:

Complete redundancy in facilities and hardware

    • Second site with its own storage, networking, computing hardware, and user access mechanisms in place
    • Sufficient computing capacity is in place at the alternate site(s) to handle expected workloads alone if one site is destroyed
    • Monitoring, management, and control mechanisms are in place to facilitate fail-over
planning for disaster tolerance
Planning for Disaster Tolerance
  • Remembering that the goal is to continue operating despite loss of an entire datacenter,
    • All the pieces must be in place to allow that:
      • User access to both sites
      • Network connections to both sites
      • Operations staff at both sites
    • Business can’t depend on anything that is only at either site
planning for dt site selection
Planning for DT: Site Selection

Sites must be carefully selected:

  • Avoid hazards
    • Especially hazards common to both (and the loss of both datacenters at once which might result from that)
  • Make them a “safe” distance apart
  • Select site separation in a “safe” direction
planning for dt what is a safe distance
Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

  • Natural hazards
    • Fire (building, forest, gas leak, explosive materials)
    • Storms (Tornado, Hurricane, Lightning, Hail, Ice)
    • Flooding (excess rainfall, dam breakage, storm surge, broken water pipe)
    • Earthquakes, Tsunamis
planning for dt what is a safe distance1
Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

  • Man-made hazards
    • Nearby transportation of hazardous materials (highway, rail)
    • Terrorist with a bomb
    • Disgruntled customer with a weapon
    • Enemy attack in war (nearby military or industrial targets)
    • Civil unrest (riots, vandalism)
planning for dt site separation distance
Planning for DT: Site Separation Distance
  • Make sites a “safe” distance apart
  • This must be a compromise. Factors:
    • Risks
    • Performance (inter-site latency)
    • Interconnect costs
    • Ease of travel between sites
    • Availability of workforce
planning for dt site separation distance1
Planning for DT: Site Separation Distance
  • Select site separation distance:
    • 1-3 miles: protects against most building fires, natural gas leaks, armed intruders, terrorist bombs
    • 10-30 miles: protects against most tornadoes, floods, hazardous material spills, release of poisonous gas, non-nuclear military bomb strike
    • 100-300 miles: protects against most hurricanes, earthquakes, tsunamis, forest fires, most biological weapons, most power outages, suitcase-sized nuclear bomb
    • 1,000-3,000 miles: protects against “dirty” bombs, major region-wide power outages, and possibly military nuclear attacks

Threat Radius

slide85

"You have to be far enough away to be beyond the immediate threat you are planning for.“…"At the same time, you have to be close enough for it to be practical to get to the remote facility rapidly.“

“Disaster Recovery Sites: How Far Away is Far Enough?” By Drew Robb

Enterprise Storage Forum, September 30, 2005

http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

slide86

“Survivors of hurricanes, floods, and the London terrorist bombings offer best practices and advice on disaster recovery planning.”

“A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005

http://www.itarchitect.com/showArticle.jhtml?articleID=169400810

planning for dt site separation direction
Planning for DT: Site Separation Direction
  • Select site separation direction:
    • Not along same earthquake fault-line
    • Not along likely storm tracks
    • Not in same floodplain or downstream of same dam
    • Not on the same coastline
    • Not in line with prevailing winds (that might carry hazardous materials or radioactive fallout)
planning for dt providing total redundancy
Planning for DT:Providing Total Redundancy
  • Redundancy must be provided for:
    • Datacenter and facilities (A/C, power, user workspace, etc.)
    • Data
      • And data feeds, if any
    • Systems
    • Network
    • User access and workspace
    • Workers themselves
planning for dt life after a disaster
Planning for DT:Life After a Disaster
  • Also plan for continued operation after a disaster
    • Surviving site will likely have to operate alone for a long period before the other site can be repaired or replaced
    • If surviving site was “lights-out”, it will now need to have staff on-site
    • Provide redundancy within each site
      • Facilities: Power feeds, A/C
      • Mirroring or RAID to protect disks
      • Clustering for servers
      • Network redundancy
planning for dt life after a disaster1
Planning for DT:Life After a Disaster
  • Plan for continued operation after a disaster
    • Provide enough capacity within each site to run the business alone if the other site is lost
      • and handle normal workload growth rate
    • Having 3 full datacenters is an option to seriously consider:
      • Leaves two redundant sites after a disaster
      • Leaves 2/3 capacity instead of ½
planning for dt life after a disaster2
Planning for DT:Life After a Disaster
  • When running workload at both sites, be careful to watch utilization.
  • Utilization over 35% will result in utilization over 70% if one site is lost
  • Utilization over 50% will mean there is no possible way one surviving site can handle all the workload
planning for dt testing
Planning for DT:Testing
  • Separate test environment is very helpful, and highly recommended
  • Spreading test environment across inter-site link is best
  • Good practices require periodic testing of a simulated disaster. Allows you to:
    • Validate your procedures
    • Train your people
inter site link s
Inter-site Link(s)
  • Sites linked by:
    • DS-3/T3 (E3 in Europe) or ATM circuits from a TelCo
    • Microwave link
    • Radio Frequency link (e.g. UHF, wireless)
    • Free-Space Optics link (short distance, low cost)
    • “Dark fiber” where available (or lambdas over WDM):
      • Ethernet over fiber (10 mb, Fast, Gigabit, 10-Gigabit)
      • Fibre Channel
      • FDDI
      • Fiber links between Memory Channel switches (up to 3 km)
dark fiber availability example
Dark Fiber Availability Example

Source: AboveNet above.net

dark fiber availability example1
Dark Fiber Availability Example

Source: AboveNetabove.net

inter site link options
Inter-site Link Options
  • Sites linked by:
    • Wave Division Multiplexing (WDM), in either Coarse (CWDM) or Dense (DWDM) Wave Division Multiplexing flavors
      • Can carry any of the types of traffic that can run over a single fiber
    • Individual WDM channel(s) from a vendor, rather than entire dark fibers
inter site link choices
Inter-Site Link Choices
  • Service type choices
    • Telco-provided data circuit service, own microwave link, FSO link, dark fiber, lambda?
    • Dedicated bandwidth, or shared pipe?
    • Single or multiple (redundant) links? If redundant links, then:
      • Diverse paths?
      • Multiple vendors?
cross site data replication methods
Cross-site Data Replication Methods
  • Hardware
    • Storage controller
  • Software
    • Host-based Volume Shadowing (mirroring) software
    • Database replication or log-shipping
    • Transaction-processing monitor or middleware with replication functionality, e.g. Reliable Transaction Router (RTR)
host based volume shadowing
Host-Based Volume Shadowing
  • Host software keeps multiple disks identical
  • All writes go to all shadowset members
  • Reads can be directed to any one member
    • Different read operations can go to different members at once, helping throughput
  • Synchronization (or Re-synchronization after a failure) is done with a Copy operation
  • Re-synchronization after a node failure is done with a Merge operation
managing replicated data
Managing Replicated Data
  • If the inter-site link fails, both sites might conceivably continue to process transactions, and the copies of the data at each site would continue to diverge over time
  • This is called a “Partitioned Cluster” in the OpenVMS world, or “Split-Brain Syndrome” in the UNIX world
  • The most common solution to this potential problem is a Quorum-based scheme
quorum schemes
Quorum Schemes
  • Idea comes from familiar parliamentary procedures
  • Systems and/or disks are given votes
  • Quorum is defined to be a simple majority of the total votes
quorum schemes1
Quorum Schemes
  • In the event of a communications failure,
    • Systems in the minority voluntarily suspend processing, while
    • Systems in the majority can continue to process transactions
quorum schemes2
Quorum Schemes
  • To handle cases where there are an even number of votes
    • For example, with only 2 systems,
    • Or where half of the votes are at each of 2 sites
  • provision may be made for
    • a tie-breaking vote, or
    • human intervention
quorum schemes tie breaking vote
Quorum Schemes:Tie-breaking vote
  • This can be provided by a disk:
    • Quorum Disk
  • Or by a system with a vote, located at a 3rd site
    • Additional OpenVMS cluster member, called a “quorum node”
quorum scheme under openvms
Quorum Scheme Under OpenVMS
  • Rule of Total Connectivity
  • VOTES
  • EXPECTED_VOTES
  • Quorum
  • Loss of Quorum
  • Selection of Optimal Sub-cluster after a failure
quorum configurations in multi site clusters
Quorum configurations inMulti-Site Clusters
  • 3 sites, equal votes in 2 sites
    • Intuitively ideal; easiest to manage & operate
    • 3rd site contains a “quorum node”, and serves as tie-breaker

Site A

2 votes

Site B

2 votes

3rd Site

1 vote

quorum configurations in multi site clusters1
Quorum configurations inMulti-Site Clusters
  • 3 sites, equal votes in 2 sites
    • Hard to do in practice, due to cost of inter-site links beyond on-campus distances
      • Could use links to 3rd site as backup for main inter-site link if links are high-bandwidth and connected together
      • Could use 2 less-expensive, lower-bandwidth links to 3rd site, to lower cost
quorum configurations in 3 site clusters
Quorum configurations in3-Site Clusters

N

N

N

N

B

B

B

B

B

B

B

N

N

10 megabit

DS3, Gbe, FC, ATM

quorum configurations in multi site clusters2
Quorum configurations inMulti-Site Clusters
  • 2 sites:
    • Most common & most problematic:
      • How do you arrange votes? Balanced? Unbalanced?
      • If votes are balanced, how do you recover from loss of quorum which will result when either site or the inter-site link fails?

Site A

Site B

quorum configurations in two site clusters
Quorum configurations inTwo-Site Clusters
  • One solution: Unbalanced Votes
    • More votes at one site
    • Site with more votes can continue without human intervention in the event of loss of either the other site or the inter-site link
    • Site with fewer votes pauses on a failure and requires manual action to continue after loss of the other site

Site A

2 votes

Site B

1 vote

Can continue automatically

Requires manual intervention to continue alone

quorum configurations in two site clusters1
Quorum configurations inTwo-Site Clusters
  • Unbalanced Votes
    • Common mistake:
      • Give more votes to Primary site
      • Leave Standby site unmanned
    • Result: cluster can’t run without Primary site, unless there is human intervention, which is unavailable at the (unmanned) Standby site

Staffed

Lights-out

Site A

1 vote

Site B

0 votes

Can continue automatically

Requires manual intervention to continue alone

quorum configurations in two site clusters2
Quorum configurations inTwo-Site Clusters
  • Unbalanced Votes
    • Also very common in remote-shadowing-only clusters (for data vaulting -- not full disaster-tolerant clusters)
      • 0 votes is a common choice for the remote site in this case
        • But that has its dangers

Site A

1 vote

Site B

0 votes

Can continue automatically

Requires manual intervention to continue alone

optimal sub cluster selection
Optimal Sub-cluster Selection

Connection manager compares potential node subsets that could make up surviving portion of the cluster

  • Pick sub-cluster with the most votes
  • If votes are tied, pick sub-cluster with the most nodes
  • If nodes are tied, arbitrarily pick a winner
    • based on comparing SCSSYSTEMID values of set of nodes with most-recent cluster software revision
two site cluster with unbalanced votes1
Two-Site Cluster with Unbalanced Votes

1

1

0

0

Shadowsets

Which subset of nodes is selected as the optimal sub-cluster?

two site cluster with unbalanced votes2
Two-Site Cluster with Unbalanced Votes

1

1

0

0

Shadowsets

Nodes at this site continue

Nodes at this site CLUEXIT

two site cluster with unbalanced votes3
Two-Site Cluster with Unbalanced Votes

2

2

1

1

Shadowsets

One possible solution

quorum configurations in two site clusters3
Quorum configurations inTwo-Site Clusters
  • Balanced Votes
    • Equal votes at each site
    • Manual action required to restore quorum and continue processing in the event of either:
      • Site failure, or
      • Inter-site link failure

Site A

2 votes

Site B

2 votes

Requires manual intervention to continue alone

Requires manual intervention to continue alone

quorum recovery methods
Quorum Recovery Methods
  • Software interrupt at IPL 12 from console
    • IPC> Q
  • Availability Manager (or DECamds):
    • System Fix; Adjust Quorum
  • DTCS (or BRS) integrated tool, using same RMDRIVER (DECamds client) interface as DECamds / AM
host based volume shadowing and storageworks continuous access
Host-Based Volume Shadowing and StorageWorks Continuous Access
  • Fibre Channel introduces new capabilities into OpenVMS disaster-tolerant clusters
fibre channel and scsi in clusters
Fibre Channel and SCSI in Clusters
  • Fibre Channel and SCSI are Storage-Only Interconnects
    • Provide access to storage devices and controllers
    • Cannot carry SCS protocol (e.g. Connection Manager and Lock Manager traffic)
      • Need SCS-capable Cluster Interconnect also:
        • Memory Channel, Computer Interconnect (CI), DSSI, FDDI, Ethernet, or Galaxy Shared Memory Cluster Interconnect (SMCI)
    • Fail-over between a direct path and an MSCP-served path is first supported in OpenVMS version 7.3-1
host based volume shadowing1
Host-Based Volume Shadowing

SCS--capable interconnect

Node

Node

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

host based volume shadowing with inter site fibre channel link
Host-Based Volume Shadowingwith Inter-Site Fibre Channel Link

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

direct vs mscp served paths
Direct vs. MSCP-Served Paths

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

direct vs mscp served paths1
Direct vs. MSCP-Served Paths

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

host based volume shadowing with inter site fibre channel link1
Host-Based Volume Shadowingwith Inter-Site Fibre Channel Link

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

new failure scenarios scs link ok but fc link broken
New Failure Scenarios:SCS link OK but FC link broken

(Direct-to-MSCP-served path failover provides protection)

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

new failure scenarios scs link broken but fc link ok
New Failure Scenarios:SCS link broken but FC link OK

(Quorum scheme provides protection)

SCS--capable interconnect

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

cross site shadowed system disk
Cross-site Shadowed System Disk
  • With only an SCS link between sites, it was impractical to have a shadowed system disk and boot nodes from it at multiple sites
  • With a Fibre Channel inter-site link, it becomes possible to do this
    • but it is probably still not a good idea (single point of failure for the cluster)
shadowing performance
Shadowing Performance
  • Shadowing is primarily an availability tool, but:
    • Shadowing can often improve performance (e.g. reads), however:
    • Shadowing can often decrease performance (e.g. writes, and during merges and copies)
shadowset state hierarchy
Shadowset State Hierarchy
  • Mini-Merge state
  • Copy state
    • Mini-Copy state Transient states
    • Full Copy state
  • Full Merge state
  • Steady state
copy and merge
Copy and Merge
  • How a Copy or Merge gets started:
    • A $MOUNT command triggers a Copy
    • This implies a Copy is a Scheduled operation
    • Departure of a system from the cluster while it has a shadowset mounted triggers a Merge
    • This implies a Merge is an Unscheduled operation
i o algorithms used by shadowing
I/O Algorithms Used by Shadowing
  • Variations:
    • Reads vs. Writes
    • Steady-State vs. Copy vs. Merge
      • And for Copies and Merges,
        • Full- or Mini- operation
        • Copy or Merge Thread I/Os
        • Application I/Os: Ahead of or Behind the Copy/Merge “Fence”
application read during steady state
Application Read during Steady-State

Application

  • Read from any single member

SHDRIVER

1. Read

Member

Member

Member

application write during steady state
Application Write during Steady-State

Application

  • Write to all members in parallel

SHDRIVER

1. Write

1. Write

1. Write

Member

Member

Member

full copy and full merge thread algorithms
Full-Copy and Full-MergeThread Algorithms
  • Action or Event triggers Copy or Merge:
    • $MOUNT command triggers Copy (scheduled operation)
    • System departure from cluster while it has shadowset mounted triggers Merge (unscheduled operation)
  • SHADOW_SERVER on one node picks up Thread from work queue and does I/Os for the Copy or Merge Thread
    • SHADOW_MAX_COPY parameter controls maximum number of Threads on a node
  • SHAD$COPY_BUFFER_SIZE logical name controls number of blocks handled at a time
    • No double-buffering or other speed-up tricks are used
full copy and full merge thread algorithms1
Full-Copy and Full-MergeThread Algorithms
  • Start at first Logical Block on disk (LBN zero)
  • Process 127 blocks at a time from beginning to end
  • Symbolic “Fence” separates processed area from un-processed area

Beginning

Processed area

Fence

Un-processed area

Thread

Progress

End

full copy thread algorithm
Full-Copy Thread Algorithm
  • Read from source
  • Compare with target
  • If different, write data to target and start over at Step 1.

Shadow_Server

1. Read

2. Compare

3. Write

Source

Target

full copy thread algorithm1
Full-Copy Thread Algorithm

Why the odd algorithm?

  • To ensure correct results in cases like:
    • With application writes occurring in parallel with the copy thread
      • application writes during a Full Copy go to the set of source disks first, then to the set of target disk(s)
    • On system disks with an OpenVMS node booting or dumping while shadow copy is going on
  • To avoid the overhead of having to use the Distributed Lock Manager for every I/O
speeding shadow copies
Speeding Shadow Copies
  • Implications:
    • Shadow copy completes fastest if data is identical beforehand
      • Fortunately, this is the most-common case – re-adding a shadow member into shadowset again after it was a member before
full copy thread algorithm data identical
Full-Copy Thread Algorithm(Data Identical)
  • Read from source
  • Compare with target

Shadow_Server

1. Read

2. Compare

Source

Target

full copy thread algorithm data different
Full-Copy Thread Algorithm(Data Different)
  • Read from source
  • Compare with target (difference found)
  • Write to target
  • Read from source
  • Compare with target (difference seldom found). If different, write to target and start over at Step 4.

Shadow_Server

1. Read

4. Read

2. Compare

5. Compare

3. Write

6. Write

Source

Target

speeding shadow copies1
Speeding Shadow Copies
  • If data is very different, empirical tests have shown that it is faster to:
    • Do BACKUP/PHYSICAL from source shadowset to /FOREIGN-mounted target disk
    • Then do shadow copy afterward
  • than to simply initiate the shadow copy with differing data.
why a merge is needed
Why a Merge is Needed
  • If a system has a shadowset mounted, and may be writing data to the shadowset
  • If that system crashes, then:
    • The state of any “in-flight” Write I/Os is indeterminate
      • All, some, or none of the members may have been written to
    • An application Read I/O could potentially return different data for the same block on different disks
  • Since Shadowing guarantees that the data on all shadowset members must appear to be identical, this uncertainty must be removed
    • This is done by a Merge operation, coupled with special treatment of Reads until the Merge is complete
what happens in full merge state
What Happens in Full Merge State

Full Merge operation will:

  • Read and compare data on all members
  • Fix any differences found
  • Moves across disk from start to end, 127 blocks at a time
  • Progress tracked by a “Fence” separating the merged (behind the fence) area from the unmerged (ahead of the fence) area

Special treatment of Reads:

  • Every application read I/O must be merged:
    • Read data from each member (for just the area of the read)
    • Detect any differences
    • Overwrite any different data with data from Master member
    • Return the now-consistent read data to the application
  • This must be done for any Reads ahead of the Merge Fence
hbvs merge operations
HBVS Merge Operations

Q: Why doesn’t it matter which member we choose to duplicate to the others on a merge? Don’t we need to ensure we replicate the newest data?

A: Shadowing reproduces the semantics of a regular disk:

1) Each time you read a block, you always get the same data returned

2) Once you do a write and get successful status back, you always expect to get the new data instead

3) If the system fails while you are doing a write, and the application doesn’t get successful status returned, you don’t know if the new or old data is on disk. You have to use journaling technology to handle this case.

Shadowing provides the same behavior.

full merge thread algorithm
Full-Merge Thread Algorithm
  • Read from master
  • Compare with other member(s)
  • If different, write data to target(s) and start over at Step 1.

Shadow_Server

1. Read

2. Compare

2. Compare

3. Write

3. Write

Master

Member

Member

speeding shadow copies mini copy operations
Speeding Shadow Copies:Mini-Copy Operations
  • If one knows ahead of time that one shadowset member will temporarily be removed, one can set things up to allow a Mini-Copy instead of a Full-Copy when the disk is returned to the shadowset
  • With OpenVMS Version 8.3, Automatic Mini-Copy on Volume Processing allows a Mini-Merge write bitmap to be designated as Multi-Use, allowing it to be converted into a Mini-Copy bitmap at the time a shadowset member must be removed from the shadowset, allowing a Mini-Copy to be used instead of a Full Copy to restore redundancy to that shadowset if a repair can be made and access to the removed member can be restored later
mini copy thread algorithm
Mini-Copy Thread Algorithm
  • Only for areas the write bitmap indicates have changed:
  • Read from source
  • Write to target
  • Read from source
  • Compare with target
  • If different (quite unlikely), write data to target and start over at Step 3.

Shadow_Server

1. Read

3. Read

4. Compare

2. Write

5. Write

Source

Target

mini copy vs full copy
Mini-Copy vs. Full-Copy

If data is different, Full-Copy algorithm requires 5 I/O operations per 127-block segment:

  • Read source
  • Compare target
  • Write target
  • Read source
  • Compare target

Mini-Copy requires only 4 I/O operations per segment:

  • Read source
  • Write target
  • Read source
  • Compare target

So Mini-Copy is faster even if 100% of data has changed

compare operation
Compare operation
  • On MSCP controllers (HSJ, HSD), this is an MSCP CompareHost Data operation
    • Data is passed to the controller, which compares it with the data on disk, returning status
      • Cache is explicitly bypassed for the comparison
        • So Read-Ahead and Read Cache do not help with Compares
      • But Read-Ahead can be very helpful with Source disk Reads, particularly when the disks are close to identical
  • On SCSI / Fibre Channel controllers (EVA, MSA, XP, HSG, HSZ), this is emulated in DKDRIVER
    • Data is read from the disk into a second buffer and compared with the data using the host CPU (in interrupt state, at IPL 8)
      • Read-ahead cache can be very helpful, particularly when disks are close to identical
speeding shadow copies2
Speeding Shadow Copies
  • Read-Ahead cache in storage controller subsystems can assist with source read I/Os
  • By default, Shadowing reads from source members in a round-robin fashion
    • But with 3-member shadowsets, with two source members, I/Os are divided among 2 sources
  • $SET DEVICE /COPY_SOURCE device
    • Applied to shadowset member, makes the member the preferred source for copies & merges
    • Applied to Virtual Unit, current master member is used as source for copies & merges
full merge thread algorithm1
Full-Merge Thread Algorithm
  • Read from master
  • Compare with other member(s)
  • If different, write data to target(s) and start over at Step 1.

Shadow_Server

1. Read

2. Compare

2. Compare

3. Write

3. Write

Master

Member

Member

host based mini merge
Host-Based Mini-Merge
  • One or more OpenVMS systems keep a bitmap of areas that have been written to recently
    • Other nodes notify (update) these nodes as new areas are written to
    • Bitmaps are cleared periodically
  • If a node crashes, only areas recently written to, as indicated by the bitmap, must be merged
  • If all the bitmaps are lost, a Full Merge is needed
speeding shadow copies merges
Speeding Shadow Copies/Merges
  • Because of the sequential, non-pipelined nature of the shadow-copy algorithm, progressing across the entire LBN range of the disk, to speed shadow copies/merges:
    • Rather than forming controller-based stripesets and shadowing those, shadow individual disks in parallel, and combine them into RAID-0 (0+1) arrays with host-based RAID software
speeding shadow copies merges1
Speeding Shadow Copies/Merges
  • Dividing a disk up into 4 partitions at the controller level, and shadow-copying all 4 in parallel takes only 40% of the time required to shadow-copy the entire disk as a whole
application i os
Application I/Os
  • During Full-Copy
  • During Full-Merge
application read during full copy
Application ReadDuring Full-Copy
  • In Processed area:
  • Pick any member to satisfy a read from

Processed area

Fence

  • In Un-processed area:
  • Pick any source member to satisfy a read from (never read from target)

Un-processed area

application read during full copy already copied area
Application Read during Full-Copy; already-copied area

Application

  • Read from any single member
  • Return data to application

SHDRIVER

1. Read

Member

Member

Member

application read during full copy not yet copied area
Application Read during Full-Copy; not-yet-copied area

Application

  • Read from any source disk
  • Return data to application

SHDRIVER

1. Read

Source

Source

Target

Full-Copy underway

application read during full merge
Application ReadDuring Full-Merge
  • In Processed area:
  • Pick one member to read from

Processed area

Fence

  • In Un-processed area:
  • Merge the area covered by the read operation
  • Return data

Un-processed area

application read during full merge not yet merged area
Application Read during Full-Merge; not-yet-merged area

Application

  • Read from source
  • Compare with target(s)
  • If different, write data to target(s) and start over at Step 1.
  • Return data to application

SHDRIVER

1. Read

2. Compare

2. Compare

3. Write

3. Write

Master

Member

Member

application read during full merge already merged area
Application Read during Full-Merge; already-merged area

Application

  • Read from any single member

SHDRIVER

1. Read

Member

Member

Member

application write during full copy
Application WriteDuring Full-Copy
  • In Processed area:
  • Write to all members in parallel

Processed area

Fence

  • In Un-processed area:
  • Write to all source members in parallel
  • Write to target disk

Un-processed area

application write during full copy already copied area
Application Write during Full-Copy; already-copied area

Application

  • Write to all members in parallel

SHDRIVER

1. Write

1. Write

1. Write

Source

Source

Target

Full-Copy underway

application write during full copy not yet copied area
Application Write during Full-Copy; not-yet-copied area

Application

  • Write to all source member(s); wait until these writes have completed
  • Write to all target(s)

SHDRIVER

1. Write

1. Write

2. Write

Source

Source

Target

Full-Copy underway

application writes during full merge
Application WritesDuring Full-Merge
  • Regardless of area:
  • Write to all members in parallel

Processed area

Fence

Un-processed area

application write during full merge
Application Write during Full-Merge

Application

  • Write to all members in parallel

SHDRIVER

1. Write

1. Write

1. Write

Master

Member

Member

Full-Merge underway

shadowing between sites in multi site clusters
Shadowing Between Sitesin Multi-Site Clusters

Because Direct operations are lower in latency than MSCP-served operations, even when the inter-site distance is small:

It is generally best to have an inter-site Fibre Channel link, if possible.

And because inter-site latency can be much greater than intra-site latency, due to the speed of light:

If the inter-site distance is large, it is best to direct Read operations to the local disks, not remote disks

  • Write operations have to go to all disks in a shadowset, remote as well as local members

If the inter-site distance is small, performance may be best if reads are spread across all members at both sites

shadowset member selection for reads read cost
Shadowset Member Selection for Reads:Read Cost
  • Shadowing assigns a default “read cost” for each shadowset member, with values in this order (lowest cost to highest cost):
      • DECram device
      • Directly-connected device in the same physical location
      • Directly-connected device in a remote location
      • DECram MSCP-served device
      • All other MSCP-served devices
    • Shadowing adds the Read Cost to the Queue Length and picks the member with the lowest value to do the read
      • Round-robin distribution for disks with equal values
shadowset member selection for reads read cost1
Shadowset Member Selection for Reads:Read Cost
  • For even greater control, one can override the default “read cost” to a given disk from a given node:

$ SET DEVICE /READ_COST=nnn $1$DGAnnn:

For reads, OpenVMS adds the Read Cost to the queue length and picks the member with the lowest value to do the read

  • To return all member disks of a shadowset to their default values after such changes have been done, one can use the command:
    • $ SET DEVICE /READ_COST = 1 DSAnnn:
shadowing between sites local vs remote reads on fibre channel
Shadowing Between Sites:Local vs. Remote Reads on Fibre Channel
  • With an inter-site Fibre Channel link, Shadowing can’t tell which Fibre Channel disks are local and which are remote, so we have to give it some help:
    • To indicate which site a given disk is at, do:

$ SET DEVICE/SITE=xxx $1$DGAnnn:

    • To indicate which site a given OpenVMS node is at, do:

$ SET DEVICE/SITE=xxx DSAnn:

for each virtual unit, from each OpenVMS node, specifying the appropriate SITE value for each OpenVMS node.

data protection mechanisms and scenarios
Data Protection Mechanisms and Scenarios
  • Protection of the data is obviously extremely important in a disaster-tolerant cluster
  • We’ll examine the mechanisms Volume Shadowing uses to protect the data
  • We’ll look at one scenario that has happened in real life and resulted in data loss:
    • “Wrong-way Shadow Copy”
  • We’ll also look at two obscure but potentially dangerous scenarios that theoretically could occur and would result in data loss:
    • “Creeping Doom” and “Rolling Disaster”
protecting shadowed data
Protecting Shadowed Data
  • Shadowing keeps a “Generation Number” in the SCB on shadow member disks
  • Shadowing “Bumps” the Generation number at the time of various shadowset events, such as mounting, dismounting, or membership changes (addition of a member or loss of a member)
protecting shadowed data1
Protecting Shadowed Data
  • Generation number is designed to monotonically increase over time, never decrease
  • Implementation is based on OpenVMS timestamp value
    • During a “Bump” operation it is increased
      • to the current time value, or
        • if the generation number already represents a time in the future for some reason (such as time skew among cluster member clocks), then it is simply incremented
    • The new value is stored on all shadowset members at the time of the Bump operation
protecting shadowed data2
Protecting Shadowed Data
  • Generation number in SCB on removed members will thus gradually fall farther and farther behind that of current members
  • In comparing two disks, a later generation number should always be on the more up-to-date member, under normal circumstances
wrong way shadow copy scenario
“Wrong-Way Shadow Copy” Scenario
  • Shadow-copy “nightmare scenario”:
    • Shadow copy in “wrong” direction copies old data over new
  • Real-life example:
    • Inter-site link failure occurs
    • Due to unbalanced votes, Site A continues to run
      • Shadowing increases generation numbers on Site A disks after removing Site B members from shadowset
wrong way shadow copy
Wrong-Way Shadow Copy

Site B

Site A

Incoming transactions

(Site now inactive)

Inter-site link

Data being updated

Data becomes stale

Generation number still at old value

Generation number now higher

wrong way shadow copy1
Wrong-Way Shadow Copy
  • Site B is brought up briefly by itself
    • Shadowing can’t see Site A disks. Shadowsets mount with Site B disks only. Shadowing bumps generation numbers on Site B disks. Generation number is now greater than on Site A disks.
wrong way shadow copy2
Wrong-Way Shadow Copy

Site B

Site A

Isolated nodes rebooted just to check hardware; shadowsets mounted

Incoming transactions

Data being updated

Data still stale

Generation number now highest

Generation number unaffected

wrong way shadow copy3
Wrong-Way Shadow Copy
  • Link gets fixed. “Just to be safe,” they decide to reboot the cluster from scratch. The main site is shut down. The remote site is shut down. Then both sites are rebooted at once.
  • Shadowing compares shadowset generation numbers and thinks the Site B disks are more current, and copies them over to Site A’s disks. Result: Data Loss.
wrong way shadow copy4
Wrong-Way Shadow Copy

Site B

Site A

Before link is restored, entire cluster is taken down, “just in case”, then rebooted from scratch.

Inter-site link

Shadow Copy

Valid data overwritten

Data still stale

Generation number is highest

protecting shadowed data3
Protecting Shadowed Data
  • If shadowing can’t “see” a later disk’s SCB (i.e. because the site or link to the site is down), it may use an older member and then update the Generation number to a current timestamp value
  • New /POLICY=REQUIRE_MEMBERS qualifier on MOUNT command prevents a mount unless all of the listed members are present for Shadowing to compare Generation numbers on
  • New /POLICY=VERIFY_LABEL qualifier on MOUNT means volume label on member must be SCRATCH_DISK, or it won’t be added to the shadowset as a full-copy target
creeping doom scenario
“Creeping Doom” Scenario

Inter-site link

Shadowset

creeping doom scenario1
“Creeping Doom” Scenario

A lightning strike hits the network room, taking out (all of) the inter-site link(s).

Inter-site link

Shadowset

creeping doom scenario2
“Creeping Doom” Scenario
  • First symptom is failure of link(s) between two sites
    • Forces choice of which datacenter of the two will continue
  • Transactions then continue to be processed at chosen datacenter, updating the data
creeping doom scenario3
“Creeping Doom” Scenario

Incoming transactions

(Site now inactive)

Inter-site link

Data being updated

Data becomes stale

creeping doom scenario4
“Creeping Doom” Scenario
  • In this scenario, the same failure which caused the inter-site link(s) to go down expands to destroy the entire datacenter
creeping doom scenario5
“Creeping Doom” Scenario

Inter-site link

Data with updates is destroyed

Stale data

creeping doom scenario6
“Creeping Doom” Scenario
  • Transactions processed after “wrong” datacenter choice are thus lost
    • Commitments implied to customers by those transactions are also lost
creeping doom scenario7
“Creeping Doom” Scenario
  • Techniques for avoiding data loss due to “Creeping Doom”:
    • Tie-breaker at 3rd site helps in many (but not all) cases
    • 3rd copy of data at 3rd site
rolling disaster scenario
“Rolling Disaster” Scenario
  • Problem or scheduled outage makes one site’s data out-of-date
  • While doing a shadowing Full-Copy to update the disks at the formerly-down site, a disaster takes out the primary site
rolling disaster scenario1
“Rolling Disaster” Scenario

Inter-site link

Shadow Copy operation

Target disks

Source disks

rolling disaster scenario2
“Rolling Disaster” Scenario

Inter-site link

Shadow Copy interrupted

Partially-updated disks

Source disks destroyed

rolling disaster scenario3
“Rolling Disaster” Scenario
  • Techniques for avoiding data loss due to “Rolling Disaster”:
    • Keep copy (backup, snapshot, clone) of out-of-date copy at target site instead of over-writing the only copy there. Perhaps perform shadow copy to a different set of disks.
      • The surviving copy will be out-of-date, but at least you’ll have some copy of the data
  • Keeping a 3rd copy of data at 3rd site is the only way to ensure there is no data lost
quorum configurations in two site clusters4
Quorum configurations inTwo-Site Clusters
  • Balanced Votes
    • Note: Using REMOVE_NODE option with SHUTDOWN.COM (post V6.2) when taking down a node effectively “unbalances” votes:

Node

1 Vote

Node

1 Vote

Node

1 Vote

Node

1 Vote

Total votes: 4 Quorum: 3

Shutdown with

REMOVE_NODE

Node

1 Vote

Node

1 Vote

Node

1 Vote

Total votes: 3 Quorum: 2

easing system management of disaster tolerant clusters tips and techniques
Easing System Management of Disaster-Tolerant Clusters: Tips and Techniques
  • Create a “cluster-common” disk
  • Use AUTOGEN “include” files
  • Use “Cloning” technique to ease workload of maintaining multiple system disks
system management of disaster tolerant clusters cluster common disk
System Management of Disaster-Tolerant Clusters: Cluster-Common Disk
  • Create a “cluster-common” disk
    • Cross-site shadowset
    • Mount it in SYLOGICALS.COM
    • Put all cluster-common files there, and define logical names in SYLOGICALS.COM to point to them:
      • SYSUAF, RIGHTSLIST
      • Queue file, LMF database, etc.
system management of disaster tolerant clusters cluster common disk1
System Management of Disaster-Tolerant Clusters: Cluster-Common Disk
  • Put startup files on cluster-common disk also; and replace startup files on all system disks with a pointer to the common one:
    • e.g. SYS$STARTUP:STARTUP_VMS.COM contains only:
    • $ @CLUSTER_COMMON:SYSTARTUP_VMS
  • To allow for differences between nodes, test for node name in common startup files, e.g.
    • $ NODE = F$GETSYI(“NODENAME”)
    • $ IF NODE .EQS. “GEORGE” THEN ...
system management of disaster tolerant clusters autogen include files
System Management of Disaster-Tolerant Clusters: AUTOGEN “include” Files
  • Create a MODPARAMS_COMMON.DAT file on the cluster-common disk which contains system parameter settings common to all nodes
    • For multi-site or disaster-tolerant clusters, also create one of these for each site
  • Include an AGEN$INCLUDE_PARAMS line in each node-specific MODPARAMS.DAT to include the common parameter settings
system management of disaster tolerant clusters system disk cloning
System Management of Disaster-Tolerant Clusters: System Disk “Cloning”
  • Use “Cloning” technique to replicate system disks
  • Goal: Avoid doing “n” upgrades for “n” system disks
system management of disaster tolerant clusters system disk cloning1
System Management of Disaster-Tolerant Clusters: System Disk “Cloning”
  • Create “Master” system disk with roots for all nodes. Use Backup to create Clone system disks.
    • To minimize disk space, move dump files off system disk
  • Before an upgrade, save any important system-specific info from Clone system disks into the corresponding roots on the Master system disk
    • Basically anything that’s in SYS$SPECIFIC:[*]
    • Examples: ALPHAVMSSYS.PAR, MODPARAMS.DAT, AGEN$FEEDBACK.DAT, ERRLOG.SYS, OPERATOR.LOG, ACCOUNTNG.DAT
  • Perform upgrade on Master disk
  • Use BACKUP to copy Master disk to Clone disks again.
historical context
Historical Context

Example: New York City, USA

  • 1993 World Trade Center bombing raised awareness of DR and prompted some improvements
  • Sept. 11, 2001 has had dramatic and far-reaching effects
    • Scramble to find replacement office space
    • Many datacenters moved off Manhattan Island, some out of NYC entirely
    • Increased distances to DR sites
    • Induced regulatory responses (in USA & abroad)
trends and driving forces in the us
Trends and Driving Forces in the US
  • BC, DR and DT in a post-9/11 world:
    • Recognition of greater risk to datacenters
      • Particularly in major metropolitan areas
    • Push toward greater distances between redundant datacenters
      • It is no longer inconceivable that, for example, terrorists might obtain a nuclear device and destroy the entire NYC metropolitan area
trends and driving forces in the us1
Trends and Driving Forces in the US
  • "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“
    • http://www.sec.gov/news/studies/34-47638.htm
  • Agencies involved:

Federal Reserve System

Department of the Treasury

Securities & Exchange Commission (SEC)

  • Applies to:

Financial institutions critical to the US economy

us draft interagency white paper
US Draft Interagency White Paper

The early “concept release” inviting input made mention of a 200-300 mile limit (only as part of an example when asking for feedback as to whether any minimum distance value should be specified or not):

“Sound practices. Have the agencies sufficiently described expectations regarding out-of-region back-up resources? Should some minimum distance from primary sites be specified for back-up facilities for core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 - 300 miles between primary and back-up sites)? What factors should be used to identify such a minimum distance?”

us draft interagency white paper1
US Draft Interagency White Paper

This induced panic in several quarters:

  • NYC feared additional economic damage of companies moving out
  • Some pointed out the technology limitations of some synchronous mirroring products and of Fibre Channel at the time which typically limited them to a distance of 100 miles or 100 km

Revised draft contained no specific distance numbers; just cautionary wording

Ironically, that same non-specific wording now often results in DR datacenters 1,000 to 1,500 miles away

us draft interagency white paper2
US Draft Interagency White Paper

“Maintain sufficient geographically dispersed resources to meet recovery and resumption objectives.”

“Long-standing principles of business continuity planning suggest that back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location.”

us draft interagency white paper3
US Draft Interagency White Paper

“Organizations should establish back-up facilities a significant distance away from their primary sites.”

“The agencies expect that, as technology and business processes … continue to improve and become increasingly cost effective, firms will take advantage of these developments to increase the geographic diversification of their back-up sites.”

ripple effect of regulatory activity within the usa
Ripple effect of Regulatory Activity Within the USA
  • National Association of Securities Dealers (NASD):
    • Rule 3510 & 3520
  • New York Stock Exchange (NYSE):
    • Rule 446
ripple effect of regulatory activity outside the usa
Ripple effect of Regulatory Activity Outside the USA
  • United Kingdom: Financial Services Authority:
    • Consultation Paper 142 – Operational Risk and Systems Control
  • Europe:
    • Basel II Accord
  • Australian Prudential Regulation Authority
    • Prudential Standard for business continuity management APS 232 and guidance note AGN 232.1
  • Monetary Authority of Singapore (MAS)
    • “Guidelines on Risk Management Practices – Business Continuity Management” affecting “Significantly Important Institutions” (SIIs)
resiliency maturity model project
Resiliency Maturity Model project
  • The Financial Services Technology Consortium (FTSC) has begun work on a Resiliency Maturity Model
    • Taking inspiration from the Carnegie Mellon Software Engineering Institute’s Capability Maturity Model (CMM) and Networked Systems Survivability Program
    • Intent is to develop industry standard metrics to evaluate an institution’s business continuity, disaster recovery, and crisis management capabilities
long distance cluster issues
Long-distance Cluster Issues
  • Latency due to speed of light becomes significant at higher distances. Rules of thumb:
    • About 1 ms per 100 miles, one-way
    • About 1 ms per 50 miles round-trip latency
  • Actual circuit path length can be longer than highway mileage between sites
  • Latency can adversely affect performance of
    • Remote I/O operations
    • Remote locking operations
differentiate between latency and bandwidth
Differentiate between latency and bandwidth
  • Can’t get around the speed of light and its latency effects over long distances
    • Higher-bandwidth link doesn’t mean lower latency
san extension
SAN Extension
  • Fibre Channel distance over fiber is limited to about 100 kilometers
    • Shortage of buffer-to-buffer credits adversely affects Fibre Channel performance above about 50 kilometers
  • Various vendors provide “SAN Extension” boxes to connect Fibre Channel SANs over an inter-site link
  • See SAN Design Reference Guide Vol. 4 “SAN extension and bridging”:
    • http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00310437/c00310437.pdf
disk data replication
Disk Data Replication
  • Data mirroring schemes
    • Synchronous
      • Slower, but no chance of data loss in conjunction with a site loss
    • Asynchronous
      • Faster, and works for longer distances
        • but can lose seconds’ or minutes’ worth of data (more under high loads) in a site disaster
continuous access synchronous replication
Continuous AccessSynchronous Replication

Node

Node

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

EVA

EVA

Mirrorset

continuous access synchronous replication1
Continuous AccessSynchronous Replication

Node

Node

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Mirrorset

continuous access synchronous replication2
Continuous AccessSynchronous Replication

Node

Node

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Success status

Mirrorset

continuous access synchronous replication3
Continuous AccessSynchronous Replication

Node

Node

Success status

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Success status

Mirrorset

continuous access synchronous replication4
Continuous AccessSynchronous Replication

Node

Node

Application continues

Success status

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Success status

Mirrorset

continuous access asynchronous replication
Continuous AccessAsynchronous Replication

Node

Node

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

EVA

EVA

Mirrorset

continuous access asynchronous replication1
Continuous AccessAsynchronous Replication

Node

Node

Success status

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

EVA

EVA

Mirrorset

continuous access asynchronous replication2
Continuous AccessAsynchronous Replication

Node

Node

Application continues

Success status

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

EVA

EVA

Mirrorset

continuous access asynchronous replication3
Continuous AccessAsynchronous Replication

Node

Node

Application continues

Success status

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Mirrorset

synchronous versus asynchronous replication and link bandwidth

MB/Sec

Synchronous – RPO = 0

Asynchronous – RPO 2 hrs. max

Asynchronous – RPO many hrs.

0 8 am 12 noon 5 pm 12 pm

Time

Synchronous versus Asynchronous Replication and Link Bandwidth

Application write bandwidth

data replication and long distances
Data Replication and Long Distances
  • Some vendors claim synchronous mirroring is impossible at a distance over 100 kilometers, 100 miles, or 200 miles, because their product cannot support synchronous mirroring over greater distances
  • OpenVMS Volume Shadowing does synchronous mirroring
    • Acceptable application performance is the only limit found so far on inter-site distance for HBVS
long distance synchronous host based mirroring software tests
Long-distance SynchronousHost-based Mirroring Software Tests
  • OpenVMS Host-Based Volume Shadowing (HBVS) software (host-based mirroring software)
  • SAN Extension used to extend SAN using FCIP boxes
  • AdTech box used to simulate distance via introduced packet latency
  • No OpenVMS Cluster involved across this distance (no OpenVMS node at the remote end; just “data vaulting” to a “distant” disk controller)
mitigating the impact of distance
Mitigating the Impact of Distance
  • Do transactions as much as possible in parallel rather than serially
    • May have to find and eliminate hidden serialization points in applications and system software
    • Have to avoid dependencies between parallel transactions in flight at one time
  • Minimize number of round trips between sites to complete a transaction
minimizing round trips between sites
Minimizing Round Trips Between Sites
  • Some vendors have Fibre Channel SCSI-3 protocol tricks to do writes in 1 round trip vs. 2
    • e.g. Brocade’s “FastWrite” or Cisco’s “Write Acceleration”
  • Application design can also affect number of round-trips required between sites
mitigating impact of inter site latency
Mitigating Impact of Inter-Site Latency
  • How applications are distributed across a multi-site OpenVMS cluster can affect performance
    • This represents a trade-off among performance, availability, and resource utilization
application scheme 1 hot primary cold standby
Application Scheme 1:Hot Primary/Cold Standby
  • All applications normally run at the primary site
    • Second site is idle, except for data replication work, until primary site fails, then it takes over processing
  • Performance will be good (all-local locking)
  • Fail-over time will be poor, and risk high (standby systems not active and thus not being tested)
  • Wastes computing capacity at the remote site
application scheme 2 hot hot but alternate workloads
Application Scheme 2:Hot/Hot but Alternate Workloads
  • All applications normally run at one site or the other, but not both; data is mirrored between sites, and the opposite site takes over upon a failure
  • Performance will be good (all-local locking)
  • Fail-over time will be poor, and risk moderate (standby systems in use, but specific applications not active and thus not being tested from that site)
  • Second site’s computing capacity is actively used
application scheme 3 uniform workload across sites
Application Scheme 3:Uniform Workload Across Sites
  • All applications normally run at both sites simultaneously. (This would be considered the “norm” for most OpenVMS clusters.)
  • Surviving site takes all load upon failure
  • Performance may be impacted (some remote locking) if inter-site distance is large
  • “Fail-over” time will be excellent, and risk low (all systems are already in use running the same applications, thus constantly being tested)
  • Both sites’ computing capacity is actively used
work arounds being used today
Work-arounds being used today
  • Multi-hop replication
    • Synchronous to nearby site
    • Asynchronous to far-away site
  • Transaction-based replication
    • e.g. replicate transaction (a few hundred bytes) with Reliable Transaction Router instead of having to replicate all the database page updates (often 8 kilobytes or 64 kilobytes per page) and journal log file writes behind a database
promising areas for investigation
Promising areas for investigation

Parallelism as a potential solution

  • Rationale:
    • Adding 30 milliseconds to a typical transaction for a human may not be noticeable
      • Having to wait for many 30-millisecond transactions in front of yours slows things down

Applications in the future may have to be built with greater awareness of inter-site latency

  • Promising directions:
    • Allow many more transactions to be in flight in parallel
    • Each will take longer, but overall throughput (transaction rate) might be the same or even higher than now
testing simulation
Testing / Simulation
  • Before incurring the risk and expense of site selection, datacenter construction, and inter-site link procurement:
  • Test within a single-datacenter test environment, with distance simulated by introducing packet latency
  • A couple of vendors / products:
    • Shunra STORM Network Emulator
    • Spirent AdTech
contrast compare hbvs with ca
Contrast / Compare HBVS with CA

Implementation

  • HBVS:
    • HBVS is implemented in software at the host level
  • CA:
    • CA is implemented in software (firmware) at the controller subsystem level
contrast compare hbvs with ca1
Contrast / Compare HBVS with CA

Writing Data

  • HBVS:
    • Host writes data to multiple disk LUNs in parallel
  • CA:
    • Host writes data to a LUN on one controller; that controller talks to other controller(s) and sends data to them so they can write it to their disk(s)
host based volume shadowing2
Host-Based Volume Shadowing

SCS--capable interconnect

Node

Node

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

host based volume shadowing writes to all members
Host-Based Volume Shadowing:Writes to All Members

Node

Node

Write

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

continuous access
Continuous Access

Inter-site SCS Link

Node

Node

Inter-site FC Link

FC Switch

FC Switch

EVA

EVA

Controller-Based

Mirrorset

continuous access writes thru master primary side controller
Continuous Access:Writes thru Master/Primary-side Controller

Node

Node

Write

FC Switch

FC Switch

Controller in charge of mirrorset:

Write

EVA

EVA

Mirrorset

contrast compare hbvs with ca2
Contrast / Compare HBVS with CA

Reading Data

  • HBVS:
    • Host can reads from any member (selection is based on Read_Cost criteria) so in multi-site cluster all 2-3 sites can read from local copy and not incur inter-site latency
  • CA:
    • Host must read a given LUN from a single controller at a time; selection of replication direction (and thus which controller to direct reads/writes to) is made at controller level
    • In multi-site configurations, reads will only be local from 1 site and remote from the other site(s)
      • Proxy reads are possible but discouraged for performance reasons; hosts might as well direct the I/O request to the controller in charge vs. asking another controller to relay the request
host based volume shadowing reads from local member
Host-Based Volume Shadowing:Reads from Local Member

Node

Node

Read

Read

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

continuous access reads from one site are remote one local
Continuous Access:Reads from one site are Remote, one Local

Node

Node

Read

Read

FC Switch

FC Switch

Controller in charge of mirrorset:

EVA

EVA

Mirrorset

controller based data replication data access at remote site
Controller-based Data Replication:Data Access at Remote Site
  • All access to a disk unit is typically from one controller at a time
    • Failover involves controller commands
      • Manual, or scripted (but still take some time to perform)
    • Read-only access may be possible at remote site with some products; other controller may be able to redirect a request (“Proxy” I/O) to the primary controller in some products

Site A

Site B

No Access

Secondary

Primary

All Access

Replication

continuous access failover required on controller failure
Continuous AccessFailover Required on Controller Failure

Node

Node

Nodes must now switch to access data through this controller

FC Switch

FC Switch

EVA

EVA

Mirrorset

controller based data replication bi directional replication
Controller-Based Data Replication:“Bi-directional” Replication
  • CA can only replicate a given disk unit in one direction or the other
  • Some disk units may be replicated in one direction, and other disk units in the opposite direction at the same time; this may be called “bi-directional” replication even though it is just two “one-way” copies

Site B

Site A

Secondary

Primary

Transactions

Replication

Primary

Secondary

Replication

Transactions

host based volume shadowing writes in both directions at once
Host-Based Volume Shadowing:Writes in Both Directions at Once

Node

Node

Write

Write

FC Switch

FC Switch

EVA

EVA

Host-Based

Shadowset

contrast compare hbvs with ca3
Contrast / Compare HBVS with CA

Number of replicated copies:

  • HBVS:
    • Can support 2-member or 3-member shadowsets
      • Some customers have 3 sites with 1 shadowset member at each of the 3 sites, and can survive the destruction of any 2 sites and continue without data loss
  • CA:
    • Can support one remote copy (2-site protection)
contrast compare hbvs with ca4
Contrast / Compare HBVS with CA

Host CPU Overhead:

  • HBVS:
    • Extremely low CPU overhead (typically un-measurable)
      • Reads: Compare queue length + Read_Cost for each disk and select one disk to read from
      • Writes: Clone I/O Request Packet and send to 2 (or 3) disks
      • Copy/Merge operations: Shadow_Server process issues 127-block I/Os
      • No RAID-5 parity calculation
  • CA:
    • No additional host CPU overhead
      • Controller CPU load overload may be of concern undersome circumstances
contrast compare hbvs with ca5
Contrast / Compare HBVS with CA

Replication between different disk/controller types:

  • HBVS:
    • Can support shadowsets across different storage subsystems
      • Can be mix of technologies, e.g. local SCSI and Fibre Channel
      • Even supports shadowing different-sized disk units
  • CA:
    • Must replicate to a like storage controller
      • But CA-XP supports layering on top of other Fibre Channel storage and replicating it between XP boxes
contrast compare hbvs with ca6
Contrast / Compare HBVS with CA

OS Platform Support:

  • HBVS:
    • Protects OpenVMS data
      • Can serve shadowed data to other platforms with NFS, Advanced Server / Samba, HTTP, FTP, etc.
  • CA:
    • Supports multiple OS platforms
contrast compare hbvs with ca7
Contrast / Compare HBVS with CA

Type(s) of mirroring supported:

  • HBVS:
    • Does strictly synchronous mirroring
      • Advanced System Concepts, Inc. RemoteSHADOW product does asynchronous host-based replication
  • CA:
    • Can do either synchronous or asynchronous mirroring
why hasn t ca been more widely used in openvms cluster configurations
Why hasn’t CA been more widely used in OpenVMS Cluster configurations?
  • Failover Time:
    • Failover with HBVS is automatic, and can happen within seconds
    • Failover with CA tends to be manual, or scripted at best, resulting in longer downtime in conjunction with a disaster
  • Mirrorsets under Continuous Access can only be accessed through one controller or the other at a time
    • With HBVS, reads can be local to a site, which are faster with long inter-site distances, and the capacity of both controllers can be used at once with heavy read rates
  • Inter-site link costs:
    • HBVS can use MSCP-serving over a LAN (or MC) link
    • CA requires an inter-site SAN link, maybe SAN Extension
  • License cost
    • HBVS licenses are often less expensive than CA
when is ca most popular
When is CA most popular?
  • Cases where the Recovery Point Objective (RPO, measuring allowable data loss) is zero or low, but where the Recovery Time Objective (RTO, measuring allowable downtime) is fairly large (for example, a day or more). In this case, a Remote Vaulting solution (constantly or periodically copying data to a remote site) may be all that is needed, rather than a full-blown multi-site disaster-tolerant cluster.
when is ca most popular1
When is CA most popular?
  • An OpenVMS systems environment where the distance between datacenter sites is on the order of 1,000 miles or more, and the business needs for the data replication solution are more along the lines of Disaster Recovery rather than Disaster Tolerance, and to provide acceptable performance, asynchronous replication is the required solution. This assumes a non-zero RPO requirement, allowing for some amount of data loss in connection with a disaster.
when is ca most popular2
When is CA most popular?
  • Customer sites which have a variety of different operating-system (OS) platforms, and where OpenVMS plays only a minor role; in this case, since Continuous Access may be the chosen solution for data replication for all of the other OS platforms, it may be simpler to use Continuous Access for OpenVMS as well, and follow the same recovery/failover procedures, rather than setting up and maintaining a unique HBVS-based solution for only the minor OpenVMS piece.
when is ca most popular3
When is CA most popular?
  • Customers who need both the zero RPO of synchronous replication and the data of long inter-site distances. These customers may use a “Multi-Hop” configuration.
slide283

Data Replication over Long Distances:Multi-Hop Replication

  • It may be desirable to synchronously replicate data to a nearby “short-haul” site, and asynchronously replicate from there to a more-distant site
    • This is sometimes called “cascaded” data replication

Long-Haul

Short-Haul

100 miles

1,000 miles

Tertiary

Primary

Secondary

Synch

Asynch

openvms disaster proof configuration application
OpenVMS Disaster-Proof configuration & application

Quorum

Integrity

rx2620

SDboom Integrity Superdome

Kaboom Alpha ES40

Stream of I/Os

Stream of I/Os

Stream of I/Os

All I/O’s need to complete

to all spindles before it is

considered done.

When a spindle drops out

The shadow set is reformed

with remaining members.

I/O’s “in flight” wait for the

Shadow set to be reformed.

Shadow set

XP24000

XP12000

The longest outstanding request for an I/O during the DP demo was 13.71 seconds.

openvms system parameter settings for the disaster proof demonstration
OpenVMS System Parameter Settings for the Disaster Proof Demonstration
  • SHADOW_MBR_TMO lowered from default of 120 down to 8 seconds
  • RECNXINTERVAL lowered from default of 20 down to 10 seconds
  • TIMVCFAIL lowered from default of 1600 to 400 (4 seconds, in 10-millisecond clock units) to detect node failure in 4 seconds, worst-case, (detecting failure at the SYSAP level)
  • LAN_FLAGS bit 12 set to enable Fast LAN Transmit Timeout (give up on a failed packet transmit in 1.25 seconds, worst case, instead of an order of magnitude more in some cases)
  • PE4 set to hexadecimal 0703 (Hello transmit interval of 0.7 seconds, nominal; Listen Timeout of 3 seconds), to detect node failure in 3-4 seconds at the PEDRIVER level
disaster proof demo timeline
Disaster Proof Demo Timeline
  • Time = 0: Explosion occurs
  • Time around 3.5 seconds: Node failure detected, via either PEDRIVER Hello Listen Timeout or TIMVCFAIL mechanism. VC closed; Reconnection Interval starts.
  • Time = 8 seconds: Shadow Member Timeout expires; shadowset members removed.
  • Time around13.5 seconds: Reconnection Interval expires; State Transition begins.
  • Time = 13.71 seconds: Recovery complete; Application processing resumes.
disaster proof demo timeline1
Disaster Proof Demo Timeline

PEDRIVER Hello Listen Timeout or TIMVCFAIL Timeout

Explosion

Failure Detection Time

T = about 3.5 seconds

T = 0

disaster proof demo timeline2
Disaster Proof Demo Timeline

Failed Shadowset Members Removed

Explosion

Shadow Member Timeout

T = 8 seconds

T = 0

disaster proof demo timeline3
Disaster Proof Demo Timeline

PEDRIVER Hello Listen Timeout or TIMVCFAIL Timeout

State Transition Begins

Explosion

Reconnection Interval

T = about 3.5 seconds

T = 0

T = about 13.5 seconds

disaster proof demo timeline4
Disaster Proof Demo Timeline

State Transition Begins

Node Removed from Cluster

Application Resumes

Explosion

Lock Database Rebuild

Cluster State Transition

T = 13.71 seconds

T = 0

T = about 13.5 seconds

real life example credit lyonnais paris
Real-Life Example:Credit Lyonnais, Paris
  • Credit Lyonnais fire in May 1996
  • OpenVMS multi-site cluster with data replication between sites (Volume Shadowing) saved the data
  • Fire occurred over a weekend, and DR site plus quick procurement of replacement hardware allowed bank to reopen on Monday

Source: Metropole Paris

slide300

“ In any disaster, the key is to protect the data. If you lose your CPUs, you can replace them. If you lose your network, you can rebuild it. If you lose your data, you are down for several months. In the capital markets, that means you are dead. During the fire at our headquarters, the DIGITAL VMS Clusters were very effective at protecting the data.”Jordan DoePatrick HummelIT Director, Capital Markets Division, Credit Lyonnais

slide301

Headquarters for Manhattan\'s Municipal Credit Union (MCU) were across the street from the World Trade Center, and were devastated on Sept. 11."It took several days to salvage critical data from hard-drive arrays and back-up tapes and bring the system back up” ...“During those first few chaotic days after Sept. 11, MCU allowed customers to withdraw cash from its ATMs, even when account balances could not be verified. Unfortunately, up to 4,000 people fraudulently withdrew about $15 million."

Ann Silverthorn, Network World Fusion, 10/07/2002

http://www.nwfusion.com/research/2002/1007feat2.html

real life examples commerzbank on 9 11
Real-Life Examples: Commerzbank on 9/11
  • Datacenter near WTC towers
  • Generators took over after power failure, but dust & debris eventually caused A/C units to fail
  • Data replicated to remote site 30 miles away
  • One AlphaServer continued to run despite 104° F temperatures, running off the copy of the data at the opposite site after the local disk drives had succumbed to the heat
  • See http://h71000.www7.hp.com/openvms/brochures/commerzbank/
slide303

“Because of the intense heat in our datacenter, all systems crashed except for ourAlphaServer GS160... OpenVMS wide-areaclustering and volume-shadowing technologykept our primary system running off thedrives at our remote site 30 miles away.”Werner Boensch, Executive Vice PresidentCommerzbank, North America

real life examples of openvms international securities exchange
Real-Life Examples of OpenVMS: International Securities Exchange
  • All-electronic stock derivatives (options) exchange
  • First new stock exchange in the US in 26 years
  • Went from nothing to majority market share in 3 years
  • OpenVMS Disaster-Tolerant Cluster at the core, surrounded by other OpenVMS systems
  • See http://h71000.www7.hp.com/openvms/brochures/ise/
slide305

“OpenVMS is a proven product that’s beenbattle tested in the field. That’s why wewere extremely confident in building thetechnology architecture of the ISE onOpenVMS AlphaServer systems.”Danny Friel, Sr. Vice President,Technology / Chief Information Officer,International Securities Exchange

slide306

“ We just had a disaster at one of our 3 sites 4 hours ago. Both the site\'s 2 nodes and 78 shadow members dropped when outside contractors killed all power to the computer room during maintenance. Fortunately the mirrored site 8 miles away and a third quorum site in another direction kept the cluster up after a minute of cluster state transition.”

Lee Mah,Capital Health Authority

writing in comp.os.vms, Aug. 20, 2004

slide307

“I have lost an entire data center due to a combination of a faulty UPScombined with a car vs. powerpole, and again when we needed to do major power maintenance. Both times, the remaining half of the cluster kept us going.”

Ed Wilts, Merrill Corporation

writing in comp.os.vms, July 22, 2005

business continuity not just it
Business Continuity: Not Just IT
  • The goal of Business Continuity is the ability for the entire business, not just IT, to continue operating despite a disaster.
  • Not just computers and data:
    • People
    • Facilities
    • Communications: Data networks and voice
    • Transportation
    • Supply chain, distribution channels
    • etc.
business continuity resources
Business Continuity Resources
  • Disaster Recovery Journal:
    • http://www.drj.com/
  • Continuity Insights Magazine:
    • http://www.continuityinsights.com//
  • Contingency Planning & Management Magazine
    • http://www.contingencyplanning.com/
  • All are high-quality journals. The first two are available free to qualified subscribers
  • All hold conferences as well
multi os disaster tolerant reference architectures whitepaper
Multi-OS Disaster-Tolerant Reference Architectures Whitepaper
  • Entitled “Delivering high availability and disaster tolerance in a multi-operating-system HP Integrity server environment”
  • Describes DT configurations across all of HP’s platforms: HP-UX, OpenVMS, Linux, Windows, and NonStop
  • http://h71028.www7.hp.com/ERC/downloads/4AA0-6737ENW.pdf
tabb research report
Tabb Research Report
  • "Crisis in Continuity: Financial Markets Firms Tackle the 100 km Question"
    • available from https://h30046.www3.hp.com/campaigns/2005/promo/wwfsi/index.php?mcc=landing_page&jumpid=ex_R2548_promo/fsipaper_mcc%7Clanding_page
draft interagency white paper
Draft Interagency White Paper
  • "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“
    • http://www.sec.gov/news/studies/34-47638.htm
  • Agencies involved:

Federal Reserve System, Department of the Treasury, Securities & Exchange Commission (SEC)

  • Applies to:

Financial institutions critical to the US economy

    • But many other agencies around the world are adopting similar rules
business continuity and disaster tolerance services from hp
Business Continuity and Disaster Tolerance Services from HP

Web resources:

  • BC Services:
    • http://h20219.www2.hp.com/services/cache/10107-0-0-225-121.aspx
  • DT Services:
    • http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx
openvms disaster tolerant cluster resources
OpenVMS Disaster-Tolerant Cluster Resources
  • OpenVMS Documentation at OpenVMS website:
    • OpenVMS Cluster Systems
    • HP Volume Shadowing for OpenVMS
    • Guidelines for OpenVMS Cluster Configurations
  • OpenVMS High-Availability and Disaster-Tolerant Cluster information at the HP corporate website: http://h71000.www7.hp.com/availability/index.html andhttp://h18002.www1.hp.com/alphaserver/ad/disastertolerance.html
  • More-detailed seminar and workshop notes at http://www2.openvms.org/kparris/ and http://www.geocities.com/keithparris/
  • Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9
disaster proof demonstration
Disaster-Proof Demonstration
  • HP proves disaster tolerance works by blowing up a datacenter’s worth of hardware
  • See video at http://hp.com/go/DisasterProof/
  • For the earlier Bulletproof XP video, see:
  • http://h71028.www7.hp.com/ERC/cache/320954-0-0-0-121.html?ERL=true
slide318
T4
  • T4: T4 collects and (TLviz displays) performance data. To determine inter-site bandwidth used, you can measure the write data rate with T4’s Fibre Channel Monitor (FCM) data collector, since writes translate into remote shadowset writes. http://h71000.www7.hp.com/openvms/products/t4/index.html
    • But typically, bandwidth required to do a Shadowing Full-Copy operation in a reasonable amount of time swamps the peak application write rate and will be the major factor in inter-site link bandwidth selection
availability manager
Availability Manager
  • Availability Manager:
  • http://h71000.www7.hp.com/openvms/products/availman/index.html
    • Useful for performance monitoring and can also do a quorum adjustment to recover quorum
speaker contact info
Speaker Contact Info:

Produced in cooperation with:

ad