1 / 77

DM235 Building a Robust Business Continuation Plan

DM235 Building a Robust Business Continuation Plan. Jit Biswas Systems Director Prudential jit.biswas@prudential.com. Preparing a Business Continuation Plan Topologies for High Availability Backups Clusters Warm Standby Peer to Peer Replication. Agenda. Prudential Market Strengths

Download Presentation

DM235 Building a Robust Business Continuation Plan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DM235Building a Robust Business Continuation Plan • Jit Biswas • Systems Director • Prudential • jit.biswas@prudential.com

  2. Preparing a Business Continuation Plan Topologies for High Availability Backups Clusters Warm Standby Peer to Peer Replication Agenda

  3. Prudential Market Strengths $26.9 billion in revenue $195.9 billion in statutory assets $1 trillion+ of coverage in force for individual and group customers Prudential is a large, widely distributed, US.-based company with an international presence Over 30 million customers 2,300 firms; Approximately 65,000 employees A Word about Prudential

  4. Sybase Is Prudential’s database standard $1 billion+ annual technology budget Prudential.com gets 600,000+ hits monthly Almost 5000 IT employees Prudential’s Technology Strengths

  5. Remember, a disaster plan is never a fixed finished document - it evolves Be systematic in your plan - don't try to outguess Nature and plan for a flood, a hurricane, a fire, etc. Appoint a second in command in case the primary contact is injured/unavailable Preparing the Business Continuation Plan

  6. Common elements in any disaster: loss of information, loss of access to information & facilities, loss of people. Make a matrix, with these three as the columns, and each of your activities as a row. (Also include things like "accounts receivable," "payroll," etc. depending on your situation). Then figure out how you would respond to loss of information, access, and/or personnel for each function. How to start planning?

  7. List individual responsibilities ahead of time, and assign specific people to each task. This includes tasks such as notifying your suppliers where to deliver, calling your most important customers to tell them what has happened, calling your Board members, etc. Protect critical paper records - such as "pending" contracts, advertising, research, loan applications, etc. - which only exist on paper. To Do’s

  8. Set clear priorities among your activities. Not everything will come back to normal at the same time Decide beforehand the longest amount of time you are willing to be "dead in the water" for each of your activities. In the event that your leased-lines are lost, or in the event that you must relocate to a different site, plan for it. To Do’s

  9. Keep copies of all of your forms off site. This includes extra checks so that you can buy the emergency supplies you need. Keep a copy of your disaster plan at home. Make sure it includes the home phone numbers of the service people you rely on: your insurance agent, plumber, electrician, etc. More to Do’s

  10. What is the greatest risk? How are various groups within your company affected by downtime? What preventative measures are in place right now? How will your data be recovered? Consider the Issues

  11. Define how to deal with various aspects of the network, including loss of servers, bridges/routers, etc Specify who arranges for repairs or reconstruction and how the data recovery process occurs. Create a check-list or test procedure to verify that everything is back to normal once repairs and data recovery have taken place. Create a Procedure

  12. A hurricane took out the reservation system of a major airline, forcing agents to write tickets by hand and causing huge losses. The World Trade Center bombing: One of the banks in the building lost revenues estimated at US$20 million per day, or $13,889 per minute. What’s your risk?

  13. Hardware repairs and missed sales opportunities are only the most obvious costs. Lost productivity and idle employees Increased technical support costs such as onsite repair Missed SLA’s Loss of customer confidence and goodwill Legal liability How expensive is downtime?

  14. 67% of companies that go through a disaster lasting more than two weeks are out of business within two years. People don't plan to fail, they simply fail to plan. Don't Be Caught with Your Data Down

  15. Average Cost of Downtime

  16. Pyramid of Availability

  17. Hardware redundancy protects against computer and disk failure. Hardware redundancy: RAID (redundant array of inexpensive disk) Disk mirroring Hardware redundancy cannot protect against failures that can cause corrupted data to be written to both the primary and the redundant disk. Hardware Redundancy: The First Line of Defense

  18. How are the disks arranged into logical volumes? RAID-0 (or stripes) increases overall performance, but significantly reduce overall volume reliability. Various combinations of RAID-1 (mirroring) and RAID-0 increase performance while also increasing reliability. RAID-5 also tends to increase both performance and reliability. Approx. two to three times more time should be planned for restoring data to a RAID-5 volume than it took to back it up RAID levels

  19. Operational databases are backed up on a daily basis so that all will not be lost in the event of a system failure, that destroys or corrupts your data. One limitation to this approach is the time required to restore a database. While a database is recovering, it is inaccessible to the end-users. Cold Standby: Backup and Restore

  20. For avoiding the problem of potentially corrupted data, one can useSybase Replication Server to create a warm standby that can be brought up in the event of a system failure. Replication is usually combined with redundant hardware. The combination of logical replication and hardware redundancy provides greater protection against loss of availability than either mechanism alone. Creating a Warm Standby with Sybase Replication Server

  21. ASE 12.0 includes a Companion Server Option. 2 ASE servers act as companions in either asymmetric (master-slave) or symmetric (active/active hot standby) config. Two-node hardware cluster with two ASE servers running, both servers actively run applications. If server 1 goes down, server 2 will open up the devices of server 1 & bring them online, while continuing to handle it’s own clients Active/Active Hot Standby

  22. Active/Active Config with apps. running on both servers

  23. Companion Server takes over in the event of a failure

  24. Clients should not have to reconnect in the event of a failover. Open Client 12.0 has been enhanced to automatically try to connect to the companion server in the event of a failover. If a transaction is in a partially completed state, an error message is generated to say a failover has occurred and that the current transactionmust be resubmitted. Automatic Client Failover

  25. Automatic Client Failover

  26. ASE 12.0 shuts down individual databases from one server while the connections that were using that database are held on the companion server. Once the shutdown is complete, the primary server can be restarted, and the companion’s proxy databases can be re-established. Support for failback enables a seamless move back to the original configuration once the primary server has been restarted. Failback

  27. Agenda • Preparing a Business Continuation Plan • Topologies for High Availability • Backups • Clusters • Warm Standby • Peer to Peer Replication

  28. Did you create a new file today? Billion new files are created every day! Is your file protected? 82 percent don’t! What would it cost you to lose that file? $50,000/hr loss to re-create data $18,000 is the average hourly cost of downtime for PC networks. Are you backed up?

  29. Create a multi-layered backup schedule. Full backup, Incremental backup, Differential backup Rotate the media according to a well-defined schedule Grandfather-Father-Son: This scheme uses daily (Son), weekly (Father), and monthly (Grandfather) backup media sets. Tower of Hanoi Backup: Your First Layer of Protection

  30. Physical backups: Byte-for-byte image Faster than logical backups. Entire volume is backed up as a single entity. Logical backup Reads the superblock to obtain the names of all the directories in the fs Slower than Physical Backups Benefit of logical backups is their ability to restore single tables instead of the whole database Logical Backups: Use Third party tools

  31. How many clients are there? What types of clients are there? Do the clients have their own backup devices? How are the clients distributed? How autonomous are the client systems? Client Backup

  32. What are the temperature and humidity like? Optimal Operating conditions are 10-40 deg C, storage as 16-32 deg C, and humidity between 20-80%. How often are the drive heads cleaned? How old are the drives and tapes? Tape Environment

  33. Agenda • Preparing a Business Continuation Plan • Topologies for High Availability • Backups • Clusters • Warm Standby • Peer to Peer Replication

  34. A cluster is a group of computers (referred to as nodes) connected in a way that lets them work as a single, continuously available system. Highly available and scalable, viz. intranet servers, which are increasingly relied upon for daily operation, are a good candidate for "conversion" into a cluster. The extra nodes help ensure uptime and increase the server's throughput and storage capacity. Clusters

  35. Adaptive Server Enterprise 12.0’s Companion Server option is certified to work with thefollowing high availability solutions. Sun Microsystems – Sun Cluster IBM – HACMP Hewlett-Packard – ServiceGuard Compaq – TruCluster Microsoft – Windows NT MSCS ASE12.0 uses these solutions to detect system failure and initiate a failover. Third Party Cluster software for ASE

  36. The typical topology for an HA cluster is as follows: Two nodes are connected by Ethernet or FDDI. A "heartbeat" is passed between the nodes on the private network to monitor the health of each node. The storage arrays are redundantly connected to the servers. Only one node "owns" a given logical diskset, the other node can takeover ownership in the case of failover. How do clusters work?

  37. A cluster has two adapters, a primary and a non-primary adapter. The primary adapter is the adapter that controls the RAID arrays. When the cluster is first configured and the systems turned on, the adapter that has the higher unique ID is automatically defined to be the primary adapter. How a Cluster Works

  38. Each adapter checks periodically that it can still communicate with its system. The other adapter detects that the first adapter has stopped operating. If the non-primary adapter detects that is has lost access to the other adapter, the non-primary adapter becomes the primary adapter. Commands that were sent to the original primary adapter after access was lost are sent again, to the new primary adapter. This action is called failover . Failover in a cluster

  39. If write’s are in progress to an array (or have occurred within the last 20 sec.) when failover occurs, the array is rebuilt after the new primary adapter has taken control. If an array has one of its members missing (that is, the array is in the exposed or degraded state) when a failover occurs, the status of the array becomes offline and an error is logged. Manual intervention is needed to resolve this error. Failover of a cluster

  40. After a failover has occurred and a new adapter has been installed in place of the faulty one, the new adapter might have an ID that is higher than that of the remaining (current primary) adapter. Under that condition, the new adapter becomes the primary adapter. This action is called failback . Failback

  41. Agenda • Preparing a Business Continuation Plan • Topologies for High Availability • Backups • Clusters • Warm Standby • Peer to Peer Replication

  42. Replicate sites read_only (and do not update primary data). Remote primary update with client connection. Remote site request for primary update, with local changes. Distributed primary fragments. Corporate roll-up. Warm Standby Peer to Peer Topology Replication Topologies

  43. Replicate sites just need read only access to data Updates are done to the primary site and are propagated to replicated sites Replicate sites use replicate copies in read-only mode. Only a local client request can update the primary. Replicate Sites Read Only(and do not update primary data)

  44. Replicate Sites Read Only(and do not update primary data) Primary database Read Only Replicate Side One way Replication Read Only

  45. Replicate sites Replicate sites wish to change data Updates are done at primary site and are propagated to replicate sites Replicate sites use replicate copies in read only mode Updating primary data Replicate sites remotely log in or use direct client connections to make changes to primary data Remote primary Update with Client Connections

  46. Remote primary Update with Client Connections Read Only Replicate Site Primary Site One way Replication Read Remote Login to Primary WRITE

  47. Replicate sites Replicate sites wish to change data Updates are done at primary site and are propagated to replicate sites Replicate sites use replicate copies in read only mode, but change a local copy of their replicate data Updating Primary Data Remote (replicate) sites change their own local copy and request changes in the primary to bring everything up to date Remote Site Request for Primary Update with Local Changes

  48. Remote Site Request for Primary Update with Local Changes Applied Function Request Function

  49. Partitioned Primary Data Primary data is fragmented into any number of databases in the system Only one image of primary data set exists, but not in a single table in a single server. Remote Sites Remote sites control their own primary data, and can change it. Distributed Primary Fragments

  50. New York Chicago London Singapore Distributed Primary Fragments

More Related