CIST 1601 Information Security Fundamentals

CIST 1601 Information Security Fundamentals Chapter 13 Disaster Recovery and Incident Response Collected and Compiled By JD Willard MCSE, MCSA, Network+, Microsoft IT Academy Administrator Computer Information Systems Technology Albany Technical College

Understanding Business Continuity Business Continuity Planning (BCP) is the process of implementing policies, controls, and procedures to counteract the effects of losses, outages, or failures of critical business processes. The critical business functions are those functions that must be established as soon as possible for a business to succeed after a catastrophic event. The primary goal of the BCP is to ensure that the company maintains its long-term business goals both during and after the disruption, and mainly focuses on the continuity of the data, telecommunications, and information systems infrastructures. As part of the business continuity plan, natural disasters should be considered. Natural disasters include tornadoes, floods, hurricanes, and earthquakes. Hardware failure should also be considered. This hardware can be limited to a single computer component, but can include network link or communications line failures. The majority of the unplanned downtime experienced by a company is usually due to hardware failure. Two of the key components of BCP are Business Impact Analysis (BIA) and risk assessment.

Understanding Business Continuity Business continuity planning is a more comprehensive approach to provide guidance so the organization can continue making sales and collecting revenue. Business continuity is primarily concerned with the processes, policies, and methods that an organization follows to minimize the impact of a system failure, network failure, or the failure of any key component needed for operation—essentially, whatever it takes to ensure that the business continues. As with disaster recovery planning, it covers natural and man-made disasters. Utilities, high availability, backups, and fault tolerance are key components of business continuity.

Business Impact Analysis (2:39) Assessment Types (9:06) Undertaking Business Impact Analysis Business Impact Analysis (BIA) is the process of evaluating all the critical systems in an organization to determine impact and recovery plans. The key components of a BIA include the following: Identifying critical functions “What functions are necessary to continue operations until full service can be restored?” A small or overlooked application in a department may be critical for operations. Every department should be evaluated to ensure that no critical processes are overlooked. Prioritizing critical business functions When business is continued after an event, operations must be prioritized as to essential and nonessential functions. You should be clear about which applications or systems have priority for the resources available. Your company may find itself choosing to restore e‑mail before it restores its website. Calculating a time frame for critical systems loss How long can the organization survive without a critical function? Your organization may need to evaluate and attempt to identify the maximum time that a particular function can be unavailable. This dictates the contingencies that must be made to minimize losses from exceeding the allowable period. Estimating the tangible and intangible impact on the organization Your organization will suffer losses in an outage, such as lost production and lost sales. Intangible losses will also be a factor. Your discovery of these effects can greatly increase the company’s realization of how much a loss of service will truly cost. A BIA can help identify what insurance is needed in order for the organization to feel safe.

Utilities Utilities consist of services such as electricity, water, mail, and natural gas that are essential aspects of business continuity. Where possible, you should include fallback measures that allow for interruptions in these services. In the vast majority of cases, electricity and water are restored—at least on an emergency basis—fairly rapidly. Disasters, such as a major earthquake or hurricane, can overwhelm utility companies and government agencies, and services may be interrupted for quite a while. Critical infrastructure may be unavailable for days, weeks, or even months. If possible, build infrastructures that don’t have single points of failure or connection. As an administrator, it’s impossible to prepare for every emergency, but you can plan for those that could conceivably happen.

Recovery Time Objectives (4:39) High Availability High availability refers to the process of keeping services and systems operational during an outage (e.g. power, telephone). With high availability, the goal is to have key services available 99.999 percent of the time (also known as five nines availability). High availability and fault-tolerance refers to implementing mechanisms such as redundant array of independent disks (RAID), fault-tolerant servers and clustered servers, which would ensure that your business can still continue to operate when a system failure occurs. Implementing fault-tolerant systems and redundant technologies, and performing regular backups of your servers are all solutions for ensuring high availability systems.

Redundancy and Single Points-of-Failure (3:20) Server clustering in a networked environment Redundancy Redundancy refers to systems that are either duplicated or that fail over to other systems in the event of a malfunction. Fail-over refers to when a system that is developing a malfunction automatically switches processes to another system to continue operations. Clustering is the process of providing failover capabilities for servers by using multiple servers together. A cluster consists of several servers providing the same services. If one server in the cluster fails, the other servers will continue to operate. Clustering is a form of server redundancy. It might be necessary to set up redundant servers so that the business can still function in the event of hardware or software failure. A simple equipment failure might result in days of downtime as the problem is repaired. A single point of failure is any piece of equipment that can bring your operation down if it stops working. Neglecting single points of failure can prove disastrous. In disaster recovery planning, you might need to consider redundant connections between branches or sites. Because the records must be available between offices, this is the single point of failure that requires redundancy. If all your business is web based, to provide continued customer access it is a good idea to have some ISP redundancy in the event the Internet connection goes down. If the majority of your business is telephone based, you might look for redundancy in the phone system as opposed to the ISP. In this cluster, each system has its own data storage and data-processing capabilities. The system that is connected to the network has the additional task of managing communication between the cluster and its users. Many clustering systems allow all the systems in the cluster to share a single disk system.

Redundancy, Fault Tolerance, and High Availability (10:11) Fault Tolerance Fault tolerance is primarily the ability of a system to sustain operations in the event of a component failure. It ensures that you have the required number of components plus one extra to plug into any system in case of failure. It can be built into a server by adding a second power supply, a second CPU, and other key components. Tandem, Stratus, and HP all involve a fault-tolerant implementation where everything is N+1, and multiple computers are used to provide 100 percent availability of a single server. The redundancy strategy N+1 means that you have the number of components you need, plus one to plug into any system should it be needed. It is imperative that fault tolerance be built into your electrical infrastructure as well. At a bare minimum, an uninterruptible power supply (UPS)—with surge protection—should accompany every server and workstation. An UPS protects computers from power loss due to power outages. It contains a battery that keeps a computer running during a power sag or power outage, and gives a user time to save any unsaved data when a power outage occurs. In an onlineUPS, the computer is always running off of battery power, and the battery is continuously being recharged. There is no switchover time, and these supplies generally provide the best isolation from power line problems. A “offline” or “standby” UPS usually derives power directly from the power line, until power fails. Ferro-resonant units operate in the same way as a standby UPS unit; however, they are online with the exception that a ferro-resonant transformer is used to filter the output. This transformer is designed to hold energy long enough to cover the time between switching from line power to battery power and effectively eliminates the transfer time. Backup power can be done through the use of a generator. A generator can be used for rolling blackouts, emergency blackouts, or electrical problems. A backup generator will provide power for a limited time. It runs on gasoline or diesel to generate electricity. Brownouts are short-term decreases in voltage levels triggered by faults on the utility provider’s systems. To protect your environment from such damaging fluctuations in power, always connect your sensitive electronic equipment to power conditioners, surge protectors, and a UPS, which provides the best protection of all.

Redundant Array of Independent Disks Redundant Array of Independent Disks (RAID) 0 is disk striping. RAID enables a group, or array, of hard disks to act as a single hard disk. RAID 0 stores files in stripes, which are small blocks of data that are written across the disks in an array. Parts of a large file might be stored on every disk in a RAID 0 array. RAID 0 provides no fault tolerance If any drive fails, the entire disk space is unavailable. If a drive in a disk striping volume fails, the data is lost. This RAID implementation is primarily used for performance purposes and not for providing data availability during hard disk failures. RAID 1 includes both disk mirroring and disk duplexing. With disk mirroring, two hard disks are connected to a single hard disk controller, and a complete copy of a file is stored on each hard disk in a mirror set. Disk duplexing, which is similar to disk mirroring, uses a separate hard disk controller for each hard disk. RAID 1 provides full redundancy. If either drive fails, the data can be retrieved from the remaining drive. All data is stored on both disks which mean that when one disk fails, the other disk continues to operate. This allows you to replace the failed disk, without interrupting business operation. This solution requires a minimum of two disks and offers 100% redundancy. RAID 1 disk usage is 50% as the other 50% is for redundancy.

Redundant Array of Independent Disks RAID 3, disk striping with a parity disk, uses RAID 0 with a separate disk that stores parity information. Here, when a disk in the array fails, the system can continue to operate while the failed disk is being removed. Parity information is a value that is based on the value of the specific data stored on each disk. RAID 5 is referred to as disk striping with parity across multiple disks. RAID 5 also stores files in disk stripes, but one stripe is a parity stripe, which provides fault tolerance. The parity information is stored on a drive separate from its data so that in the event of a single drive failure, information on the functioning disks can be used to reconstruct the data from the failed disk. RAID 5 requires at least three hard disks but typically uses five to seven disks. The maximum number of disks supported is 32.

Disaster Recovery Depending on Backups Backups are duplicate copies of key information. One important method of ensuring business continuity is to back up mission-critical servers and data. Computer records are usually backed up using a backup program, backup systems, and backup procedures. Data should be backed up regularly, and you should store a copy of your backup offsite. Several types of storage mechanisms are available for data storage: Working copies Also referred to as shadow copies, are partial or full backups that are stored at the computer center for immediate use in recovering a system or lost file, if necessary. Onsite storage Refers to a location on the site of the computer center, which the company uses to store data locally. Onsite storage containers are used to store backup media. These onsite storage containers are classed according to fire, moisture, and pressure resistance. Offsite storage Refers to a location away from the computer center where backup media are kept. It can be as simple as keeping a copy of backup media at a remote office, or as complicated as a nuclear hardened high-security storage facility. The storage facility should be bonded, insured, and inspected on a regular basis to ensure that all storage procedures are being followed. Most offsite storage facilities charge based on the amount of space you require and the frequency of access you need to the stored information.

Planning, Testing, and Operational Continuity (3:11) Disaster Recovery and Succession Planning (3:26) IT Contingency Planning (3:19) Contingency Plans (5:07) Disaster Recovery Crafting a Disaster-Recovery Plan A disaster recovery plan is a written document that defines how the organization will recover from a disaster and how to restore business with minimum delay. The disaster-recovery plan deals with site relocation in the event of: An emergency Natural disaster Service outage As part of the business continuity plan, it mainly focuses on alternate procedures for processing transactions in the short term. It is carried out when the emergency occurs and immediately following the emergency. A contingency plan would be part of a disaster-recovery plan.

Backups (14:31) Disaster Recovery Understanding Backup Plan Issues When selecting backup devices and media, you should consider the physical characteristics or type of the drive. The type of the drive includes: Media type Capacity Speed Rotation scheme The frequency of backups and tape retention time. The backup time is the amount of time a tape takes to back up the data. It is based on the speed of the device and the amount of data being backed up. The restoration time is the amount of time a tape takes to restore the data. It is based on the speed of the device, the amount of data being restored, and the type of backups used. The retention time is the amount of time a tape is stored before its data is overwritten. The longer the retention time, the more media sets will be needed for backup purposes. A longer retention time will give you more flexibility for restoration. The life of a tape is the amount of time a tape is used before being destroyed. The life of a tape is based on the amount of time it is used. Most vendors provide an estimate on backup media life.

Database transaction auditing process Disaster Recovery Database systems Most modern database systems provide the ability to globally back up data or database and also provide transaction auditing and data-recovery capabilities. Transaction, or audit files, can be stored directly on archival media. In the event of a system outage or data loss, the audit file can be used to roll back the database and update it to the last transactions made. User files Word-processing documents, spreadsheets, and other user files are extremely valuable to an organization. By doing a regular backup on user systems, you can protect these documents and ensure that they’re recoverable in the event of a loss. With the cost of media being relatively cheap, including the user files in a backup every so often is highly recommended. If backups that store only the changed files are created, keeping user files safe becomes a relatively less-painful process. Applications Although you can back up applications, it is usually considered a waste of backup space as these items don’t change often and can usually be re-installed from original media. You should keep a single up-to-date version that is available for download and reinstallation.

Disaster Recovery Knowing the Backup Types A full backup provides a complete backup of all files on a server or disk, with the end result being a complete archive of the system at the specific time when the backup was performed. The archive attribute is cleared. Because of the amount of data that is backed up, full backups can take a long time to complete. A full backup is used as the baseline for any backup strategy and most appropriate when using offsite archiving. While the backup is being run, the system should not be used. In the event of a total loss of data, restoration from a full backup will be faster than other methods.

Disaster Recovery Knowing the Backup Types An incremental backup backs up files that have been created or changed since the immediately preceding backup, regardless of whether the preceding backup was a full backup, a differential backup, or an incremental backup, and resets the archive bit. Incremental backups build on each other; for example, the second incremental backup contains all of the changes made since the first incremental backup. Incremental backups are smaller than full backups, and are also the fastest backup type to perform. When restoring the data, the full backup must be restored first, followed by each incremental backup in order.

Disaster Recovery Knowing the Backup Types A differential backup includes all files created or modified since the last full backup without resetting the archive bit. Differential backups are not dependent on each other. Each differential backup contains the changes made since the last full backup. Therefore, differential backups can take a significantly loner time than incremental backups. Differential backups tend to grow as the week progresses and no new full backups have been performed. When restoring the data, the full backup must be restored first, followed by the most recent differential backup.

Grandfather, Father, Son backup method Disaster Recovery Developing a Backup Plan Grandfather-father-son backup refers to the most common rotation scheme for rotating backup media. Originally designed for tape backup, it works well for any hierarchical backup strategy. It allows for a minimum usage of backup media. The basic method is to define three sets of backups; Daily Weekly Monthly For short term archival the monthly backup is referred to as the grandfather, the weekly backup is the father, and the daily backup is the son. The last backup of the month becomes the archived backup for that month. For long term archival the annual backup is referred to as the grandfather, the monthly backup is the father, and the weekly backup is the son. The last backup of the month becomes the archived backup for that month. The last backup of the year becomes the annual backup for the year.

Full Archival backup method Disaster Recovery Developing a Backup Plan The Full Archival method keeps all data that has ever been on the system during a backup and stores it either onsite or offsite for later retrieval. In short, all full backups, all incremental backups, and any other backups are permanently kept somewhere. One major problem involves keeping records of what information has been archived. For these reasons, many larger companies don’t find this to be an acceptable method of keeping backups.

A backup server archiving server files Disaster Recovery Developing a Backup Plan The Backup Server method establishes a server with large amounts of disk space whose sole purpose is to back up data. All files on all servers are copied to the backup server on a regular basis; over time, this server’s storage requirements can become enormous. The advantage is that all backed-up data is available online for immediate access. If a system or server malfunctions, the backup server can be accessed to restore information from the last backups performed on that system. Several software manufacturers backup software create hierarchies of files: Over time, if a file isn’t accessed, it’s moved to slower media and may eventually be stored offline. This helps reduce the disk storage requirements, yet it still keeps the files that are most likely to be needed for recovery readily available In this instance, the files on the backup server contain copies of all the information and data on the APPS, ACCTG, and DB servers.

System regeneration process for a workstation or server Disaster Recovery Notice that the installation CDs are being used for the base OS and applications. Recovering a System Workstation and server failures, accidental deletion, virus infection, and natural disasters are all reasons why information might need to be restored from backup copies. When a system fails, you’ll be unable to reestablish operation without regenerating all of the system’s components. This process includes making sure hardware is functioning, restoring or installing the operating systems, restoring or installing applications, and restoring data files. When you install a new system, make a full backup of it before any data files are created. Windows Server 2008, allow you to create a model user system as a disk image on a server; the disk image is downloaded and installed when a failure occurs.

Cold Site, Hot Site, and Warm Site (2:34) Disaster Recovery Planning for Alternate Sites Hot, cold and warm sites are maintained in facilities that are owned by another company. Hot sites generally contain everything you need to bring your IT facilities up. Warm sites provide some capabilities, including computer systems and media capabilities, in the event of a disaster. Cold sites do not provide any infrastructure to support a company’s operations and requires the most setup time. Hot site A hot site is up and available 24 hours a day, seven days a week, has the advantage of a very quick return to business, as well as the ability to test a DRP without affecting current operations. It is similar to the original site in that it is equipped with all necessary hardware, software, network, and Internet connectivity fully installed, configured, and operational. It usually “mirrors” the configuration of the corporate facility. Usually, testing is as simple as switching over after ensuring it contains the latest versions of your data. When setting up a hot site, ensure that this site is sufficiently far from the corporate facility being mirrored so that it does not get affected by the same damages. Hot sites are traditionally more expensive, but they can be used for operations and recovery testing before an actual catastrophic event occurs. They require a lot of administration time to ensure that the site is ready within the maximum tolerable downtime (MTD). Expense, administration time, and the need for extensive security controls are disadvantages to using a hot site. Recovery time and testing availability are two advantages to using a hot site.

Disaster Recovery Planning for Alternate Sites Warm site A warm site represents a compromise between a hot site, which is a very expensive site and a cold site, which is not preconfigured. A warm site usually only contains the power, phone, network ports, and other base services required. When a disaster occurs at the corporate facility, additional effort is needed to bring the computers, data, and resources to the warm site. A warm site is harder to test than a hot site, but easier to test than a cold site. It only contains telecommunications equipment. Therefore, to properly test disaster recovery procedures at the warm site, alternate computer equipment such as servers would need to be set up and configured. Warm sites are less expensive than hot sites, but more expensive than cold sites. The recovery time of a warm site is slower than for a hot site, but faster than for a cold site. Warm sites usually require less administration time because only the telecommunications equipment is maintained, not the computer equipment.

Disaster Recovery Planning for Alternate Sites Cold site A cold site does not provide any equipment. These sites are merely a prearranged request to use facilities if needed. A cold site is usually only made up of empty office space, electricity, raised flooring, air conditioning, and telecommunications lines and bathrooms. A cold site still needs networking equipment and complete configuration before it can operate when a disaster strikes the corporate facilities. This DRP option is the cheapest. To properly test disaster recovery procedures at the cold site, alternate telecommunications and computer equipment would need to be set up and configured. Recovery time and testing availability are two disadvantages to using a cold site. Expense and administration time are two advantages to using a cold site.

Incident Response Policies Incident-Response Policies define how an organization will respond to an incident. An incident is: Any attempt to violate a security policy A successful penetration A compromise of a system Unauthorized access to information Systems failures Disruption of services It’s important that an incident-response policy establish at least the following items: Outside agencies that should be contacted or notified in case of an incident Resources used to deal with an incident Procedures to gather and secure evidence List of information that should be collected about an incident Outside experts who can be used to address issues if needed Policies and guidelines regarding how to handle an incident

Understanding Incident Response An incident is the occurrence of any event that endangers a system or network. Incident response encompasses forensics (identifying what has occurred) and refers to the process of identifying, investigating, repairing, documenting, and adjusting proceduresto prevent another incident. It’s a good idea to include the procedures you’ll generally follow in an incident response plan (IRP). The IRP outlines what steps are needed and who is responsible for deciding how to handle a situation. A chain of custody tells how the evidence made it from the crime scene to the courtroom, including documentation of how the evidence was collected, preserved, and analyzed.

Understanding Incident Response Step One: Identifying the Incident The first step is to identify the incident and determine if it is an incident, or just a false positive. A false positive occurs when the software classifies an action as a possible intrusion when it is actually a nonthreatening action. When a suspected incident pops up, first responders are those who must ascertain if it truly is an incident or a false alarm. When the response team has determined that an incident occurred, the next step in incident analysis involves considering how to handle it by taking a comprehensive look at the incident activity to determine the scope, priority, and threat of the incident. Escalation, involves consulting policies and appropriate management, and determining how best to conduct an investigation into the incident.

Understanding Incident Response Step Two: Investigating the Incident The process of investigating an incident involves searching logs, files, and any other sources of data about the nature and scope of the incident. If possible, you should determine whether this is part of a larger attack, a random event, or a false positive. You might find that the incident doesn’t require a response if it can’t be successful. Your investigation might conclude that a change in policies is required to deal with a new type of threat.

Understanding Incident Response Step Three: Repairing the Damage In keeping with the severity of the incident, the organization can act to mitigate the impact of the incident by containing it and eventually restoring operations back to normal. Most operating systems provide the ability to create a disaster-recovery process using distribution media or backups of system state files. In the case of a DoS attack, a system reboot may be all that is required. Your operating system manufacturer will typically provide detailed instructions or documentation on how to restore services in the event of an attack. Just as every network, regardless of size, should have a firewall, it should also be protected by antivirus software that is enabled and current. If a system has been severely compromised it may need to be regenerated from scratch. In that case, you’re highly advised to do a complete disk format or repartition to ensure that nothing is lurking on the disk, waiting to infect your network again.

Understanding Incident Response Step Four: Documenting and Reporting the Response You should document the steps you take to identify, detect, and repair the system or network . It is important to accurately determine the cause of each incident so that it can be fully contained and the exploited vulnerabilities can be mitigated to prevent similar incidents from occurring in the future. Many help-desk software systems provide detailed methods you can use to record procedures and steps. You should also report the incident to the law and/or CERT (www.cert.org) so that others can be aware of the type of attack and help look for proactive measures to prevent this from happening again. You might also want to inform the software or system manufacturer.

Understanding Incident Response Step Five: Adjusting Procedures After an incident has been successfully managed, revisit the procedures and policies in place in your organization to determine what changes, if any, need to be made. The following questions might be included in a policy or procedure manual: How did the policies work or not work in this situation? What did we learn about the situation that was new? What should we do differently next time? These simple questions can help you adjust procedures. This process is called a postmortem, the equivalent of an autopsy.

Succession Planning Succession planning outlines those internal to the organization who have the ability to step into positions when they open. By identifying key roles that cannot be left unfilled and associating internal employees who can step into those roles, you can groom those employees to make sure they are up to speed when it comes time for them to fill those positions.

Reinforcing Vendor Support Service-Level Agreements A service-level agreement (SLA) is an agreement between a company and a vendor in which the vendor agrees to provide certain functions for a specified period. They establish the contracted requirements for service through software and hardware vendors, utilities, facility management, and ISPs. The following are key measures in SLAs: Mean Time Between Failures (MTBF) is the average length of time a component will last, given average use. Usually, this number is given in hours or days. MTBF is helpful in evaluating a system’s reliability and life expectancy. Mean Time to Repair (MTTR) is the measurement of how long it takes to repair a system or component once a failure occurs. In the case of a computer system, if the MTTR is 24 hours, this tells you it will typically take 24 hours to repair it when it breaks. Code Escrow Agreements Code escrow refers to the storage and conditions of release of source code provided by a vendor. Code escrow allows customers to access the source code of installed systems under specific conditions, such as the bankruptcy of a vendor. Make sure your agreements provide you with either the source code for projects you’ve had done or a code escrow clause to acquire the software if the company goes out of business.

The End

CIST 1601 Information Security Fundamentals