Disaster Recovery 2.0

Disaster Recovery 2.0 A paradigm shift in DR Architecture. VCAP-DCD, VCP5 • Iwan ‘e1’ Rahabok • Staff SE, Strategic Accounts • +65 9119-9226 | e1@vmware.com | virtual-red-dot.blogspot.com | sg.linkedin.com/in/e1ang

Business Requirements • It is similar to Insurance. • It’s no longer acceptable to run business without DR protection. • The question is now about… • How do we cut the DR cost & complexity? People cost, technology cost, etc. Protect the Business in the event of Disaster

Disaster did strike in Singapore • 29 June 2004 • Electricity Supply Interruption • More than 300,000 homes were left in the dark • About 30% of Singapore was affected. • If both your Prod and DR datacenters were on this 30%.... • Caused by the disruption of natural gas supply from West Natuna, Indonesia. A valve at the gas receiving station operated by ConocoPhillips tripped. Natural gas supply was disrupted causing 5 units of the combined-cycle gas turbines (CCGT) at Tuas Power Station, Power Seraya Power Station and SembCorp Cogen to trip. • Some of the CCGTs could not switch to diesel successfully. Investigation into the incident is in progress. • Other Similar Incidents • The first disruption in natural gas supply occurred on 5 Aug 2002 due to a tripping of a valve in the gas receiving station which led to a power blackout.

Disaster Recovery (DR) >< Disaster Avoidance (DA) • DA requires that Disaster mustbe avoidable. • DA implies that there is Time to respond to an impending Disaster. The time window must be large enough to evacuate all necessary system. • Once avoided, for all practical purpose, there is no more disaster. • There is no recovery required. • There is no panic & chaos. • DA is about Preventing (no downtime). DR is about Recovering(already down) • 2 opposite context. It is insufficient to have DA only. DA does not protect the business when Disaster strikes. Get DR in place first, then DA.

DR Context: It’s a Disaster, so… • It might strike when we’re not ready • E.g. IT team having offsite meeting, and next flight is 8 hours away. • Key technical personnels are not around (e.g. sick or holiday) • We can’t assume Production is up. • There might be nothing for us to evacuate or migrate to DR site. • Even if the servers are up, we might not even able to access it (e.g. network is down). • Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate. • Shutting down multi-tier apps are complex and take time when you have 100s… • We can't assume certain system will not be affected • DR Exercise should involve entire datacenter. Assume the worst, and start from that point.

Singapore MAS Guidelines MAS is very clear that DR means Disaster has happened as there is outage. Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliant.

DR: Assumptions • A company wide DR Solution shall assume: • Production is down or not accessible. • Entire datacenter, not just some systems. • Key personnels are not available • Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin, RHEL admin, etc. • Intelligence should be built into the system to eliminate reliance on human expert. • Manual Run Books are not 100% up to date • Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is prone to human error. It contains thousands of steps, written by multiple authors. • Automation & virtualisation reduce this risk.

DR Principles • To Business Users, actual DR experience must be identical to the Dry Run they experience • In panic or chaotic situation, users should deal with something they are trained with. • This means Dry Run has to simulate Production (without shutting down Production) • Dry Run must be done regularly. • This ensures: • New employees are covered. • Existing employees do not forget. • The procedures are not outdated (hence incorrect or damaging) • Annual is too long a gap, especially if many users or departments are involved. • DR System must be a replica of Production System • Testing with a system that is not identical to production deems the Dry Run invalid. • Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid Dry Run, as the DR System is not the Production system. • System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage, network, security that make up “an application from business point of view”.

Datacenter wide DR Solution: Technical Requirements • Fully Automated • Eliminate reliance on many key personnels. • Eliminate outdated (hence misleading) manual runbooks. • Enable frequent Dry Run, with 0 impact to Production. • Production must not be shutdown, as this impacts the business. • Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure. • No clashing with Production Hostnames and IP addresses. • If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore. • Scalable to entire datacenter • 1000s servers • Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.

DR 1.0 architecture (current thinking) • Typical DR 1.0 solution (at infrastructure layer) has the following properties:

DR 1.0 architecture: Limitations • Technically, it is not even a DR solution • We do not recover the Production System. We merely mount production Data on a different System • The only way for the System to be recovered is to do SAN boot on DR Site. • Can’t prove to audit that DR = Production. • Registry changes, config changes, etc are hard to track at OS and Application level. • Manual mapping of data drive to associated server on DR site. • Not a scalable solution as manual update don’t scale well to 1000s servers. • Heavy on scripting, which are not tested regularly. • DR Testing relies heavily on IT expertise.

DR Requirements: Summary

Solution: replicate System + Data, not just data drive (LUN). OS, Apps, settings, etc. Implication of the solution: If Production network is not stretched, the server will be unreachable. Changing IP will break Application. If Production network is stretched, IP Address and Hostname will conflict with Production. Changing Hostname will definitely break Application. Stretched L2 network is not a full solution. Entire LAN isolation is the solution. R01: DR Copy = Production Copy • Solution: Entire Dry Run network must be isolated (bubble network) • No conflict with Production, as it’s actually identical. It’s a shadow of Production LAN. • All network services (AD, DNS, DHCP, Proxy) must exist in the Shadow Prod LAN. • Implication of the solution: • For VM, this is easily done via vSphere and SRM • For Physical Servers, they need to be connected to Dry Run LAN. Permanent connection simplifies and eliminate risk of accidental update to production.

R02: Identical User Experience desktop.ABC Corp.com Production desktop pools DR Test desktop pools (on-demand) Desktop-DRTest.ABC Corp.com • VDI is a natural companion to DR as it makes the “front-end” experience seamless. • Users use Virtual Desktop as their day to day desktop. • VDI enables us to DR the desktop too. • During Dry Run • Users connect to desktop.vmware.com for production and desktop-DR.vmware.com for Dry Run. Having 2 desktops mean the environment is completely isolated. • During actual Disaster • Desktop-DR.vmware.com is renamed to desktop.vmware.com as the original desktop.vmware.com is down (affected by the same DR). Users connect to desktop.vmware.com, just like they do in their day to day, hence creating an identical experience.

R03: No impact on Production during Dry Run • To achieve the above, the DR Solution: • Cannot require Production be shutdown or stopped. It must be Business as Usual. • Must be an independent, full copy altogether, no reliance on Production component. • Network, security, AD, DNS, Load Balancer, etc.

R04: Frequent Dry Run • To achieve the above, the DR Solution cannot: • Be laborious or prone to human error. A fully automated solution address this. • Touch production system or network. So it has to be an isolated environment. A Shadow Production LAN solves this. • VMware SRM enables the automation component for VM. You should have the full confidence that the Actual Fail Over will work. This can only be achieved if you can do frequent dry run.

Solution: Dealing with Physical Servers Singapore (Prod Site) Singapore (DR Site) Shadow Production LAN CRM-Web-Server.vmware.com10.10.10.10 CRM-Web-Server.vmware.com10.10.10.10 CRM-App-Server.vmware.com 10.10.10.20 CRM-App-Server.vmware.com 10.10.10.20 CRM-DB-Server.vmware.com 10.10.10.30 CRM-DB-Server.vmware.com 10.10.10.30 CRM-DB-Server-Test.vmware.com 20.20.20.30

Physical Servers: Dual boot option This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) • Physical Server must be dual-boot (OS): • Normal Operation: Test/Dev environment (default boot) • Dry Run or DR: Shadow Production network

Physical Servers: Dual partition option This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) 1 physical box DR Partition Test/Dev Partition

Typical Physical Network: it’s 1 network Singapore (Prod Site) Country X (any site) Singapore (DR Site) Production Networks Production PMs Production VMs Production PMs Production VMs Production PMs Production VMs AD/DNS Non-AD DNS AD/DNS Non-AD DNS AD/DNS Non-AD DNS ABC Corp operates in many countries in Asia, with Singapore being the HQ. A system may consist of multiple servers from the more than 1 country. DNS service for Windows is provided by MS AD. DNS service for non Windows is provided by non MS AD. Users (from any country) can access any servers (physical or virtual) on any country as basically there is only 1 “network”. There is routing to connect various LAN. In 1 “network”, we can’t have 2 machines with same host name or same IP. Each LAN has its own network address. Hence changing of IP address is required when moving from Prod Site to DR Site. Users Site

Site 2 needs to have 2 distinct Network This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) DR Server Test/Dev Server

Mode: Normal Operation or During Dry Run Datacenter: Site 2 Datacenter: Site 1 Shadow Production LAN (10.10.10.x) Production LAN (10.10.10.x) x Jump box Non Prod LAN (20.20.20.x) Users Site Desktop LAN (30.30.30.x)

Mode: Partial DR Datacenter: Site 2 Datacenter: Site 1 Production LAN (10.10.10.x) Non Prod LAN (20.20.20.x) Users Site Desktop LAN (30.30.30.x)

Summary: DR 2.0 and 1.0

Pre-Failover Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 1 10.10.10.10 DNS Query: www.abc.com HTTP GET: 10.10.10.10 DR Site Prod Site 192.168.10.0/24 10.10.10.0/24 VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 10.10.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT 10.20.20.0/24 10.20.20.0/24 Production PMs Production VMs

Post-Failover Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 2 192.168.10.10 DNS Query: www.abc.com HTTP GET: 192.168.10.10 DR Site Prod Site 192.168.10.0/24 10.10.10.0/24 VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 192.168.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT 10.20.20.0/24 10.20.20.0/24 Production PMs Production VMs

DR Dry Run Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 2 192.168.10.10 DNS Query: www-dr-test.abc.com HTTP GET: 192.168.10.10 DR Site Prod Site 192.168.10.0/24 10.10.10.0/24 VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 192.168.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT 10.20.20.0/24 10.20.20.0/24 DR Test PMs DR Test VMs Production PMs Production VMs

Making it Work • Strict enforcement to have external users use VIP • Strict enforcement to have peer vApp stacks use VIP • DNS failover setting at global site load-balancer would have to be manual - Network admin needed to update www.abc.com on global site load-balancer to reflect VIP at secondary DC. • Server load-balancer use only applicable for serving specific applications. Application support with load-balancers is vendor dependent, with varying depth of app support. • Applications will need to support source NAT. Some applications have known issues when used in conjunction with NAT (eg FTP), however server load-balancers may be able to mitigate issues. Vendor dependent. • Not running a stretched VLAN means VMs with strong systemic dependencies must be placed on the same site, possibly as a vApp. Communications between VMs across sites can only be done using VIP, where a specific function and pool of VMs must have already been configured.

DA • From the view of DR

DA & DR in virtual environment • DR and DA solution do not fit well together in vSphere 5 • DA requires 1 vCenter • DA needs long distance migration, which don’t work across 2 vCenters. • DR requires 2 vCenters. • vCenter prevents the same VM to appear 2x in the same vCenter. • We can’t assume vCenter on main site is recoverable. • There is confusion on DR + DA • You cannot have DA + DR on the same “system”. You need 3 instances. • 1 primary • 1 secondary for DR purpose • 1 secondary for DA purpose. • Next slide explains limitations of some DA solution for DR use case. • This is not to criticise the DA solution, as it is a good solution for DA use case.

DA Solution: Stretched Cluster (+ Long Distance vMotion) • When actual DR strikes… • We can’t assume Production is up. Hence vMotion is not a solution. • HA will kick in and boot all VMs. Orders will not be honoured. • Challenge of the above solution: How do we Test? • DR Solution must be tested regularly as per Requirement R04. • The test must be identical from user point of view, as per Requirement R02. • So the test will have to be like this: • Cut replication, then mount the LUNs, then add VMs into VC, boot the VMs. • But… we cannot mount the LUNs the same vCenter as they have the same signature! Even if we can, we must know the exact placement of each VMs (which is complex). Even if we can, we cannot boot 2 VMs on the samevCenter!This means Production VMs must be down. This fails Requirement R03. Conclusion: Stretched Cluster does not even qualify as DR Solution as it can’t be tested & it’s 100% manual.

DA Solution: 2 Clusters in 1 VC (+ Long Distance vMotion) • This is a variant of Stretched Cluster. • It fixes the risk & complexity of Stretched Cluster. And no performance impact of uncontrolled long distance vMotion. • When actual DR strikes… • We can’t assume Production is up. Hence vMotion is not a solution. • HA will not even kick in as it’s separate cluster. In fact, VMs will be in error state, appearing italized in vCenters. • Challenge of the above solution: How do we Test? • All issues facing Stretched Cluster apply. Conclusion: 2-Cluster is inferior to Stretched Cluster from DR point of view

Stretched Datacenter: View from the Network • Bro, can you add some design info on complexity of stretching the network (assume no virtualisation, all physical servers). • A lot of VMware folks don’t appreciate the complexity & implication (design, operational, performance, upgrade, troubleshooting) when a network is stretched across 2 physical datacenter (say they are 40 km apart)

Active/Active or Active/Passive • Which one makes sense?

Background • Active/Active Datacenter has many level of definition: • Both DC are actively running workload, so one is not idle. • This means Site 2 can be running non Production workload, like Test/Dev and DR. • Both DC are actively running Production workload • Build from previous, this means Site 2 must run Production workload. • Both DC are actively running Production workload, with application-level failover. • Build from previous, the same App run on both side. But the instance on Site 2 is not serving users. It’s waiting for an application-level failover. • This is typicaly done via geo-cluster solution. • Both DC are actively running Production workload, with A/A aplication-level • Both Apps are running. Normally done via global Load Balancer. • No need to failover as each App is “complete”. It has the full data, and it does not need to tell the other App when its data is updated. No transaction level integrity required. • This is the ideal. But most apps cannot do this as the data cannot be split. You can only have 1 data. In vSphere context, this is what it means by Active/Active vSphere. Both vSphere are actively running Production VMs

A closer look at Active/Active vCenter vCenter Lots of traffic between: Prod to Prod T/D to T/D 250 Prod VMs 500 Test/Dev VMs 500 Test/Dev VMs 250 Prod VMs T/D Clusters T/D Clusters Prod Clusters Prod Clusters vCenter vCenter 1000 Test/Dev VMs 500 Prod VMs T/D Clusters Prod Clusters

MAS TRM Guideline It states “near” 0, not 0. It states “should”, not “must”. It states “critical”, not all systems. So A/A is only for a subset. This points to an Application-level solution, not Infrastructure-level. We can add this capability without changing the architecture, as shown on next slide.

Adding Active/Active to a mostly Active/Passive vSphere vCenter vCenter 1000 Test/Dev VMs 500 Prod VMs T/D Clusters Prod Clusters vCenter vCenter Global LB Global LB 50 VMs 1000 Test/Dev VMs 500 Prod VMs T/D Clusters Prod Clusters 1 Cluster

Thank You

Disaster Recovery 2.0

Disaster Recovery 2.0

Presentation Transcript

Disaster Recovery

Disaster recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery 2.0

Disaster Recovery

FINANCING DISASTER RECOVERY

DISASTER RECOVERY BRANCH

Recovery Disaster Recovery Updates

Disaster Recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery