Network planning considering reliability aspects Hálózatok tervezése megbízhatósági szempontok figyelembevételével

Network planning considering reliability aspectsHálózatok tervezése megbízhatósági szempontok figyelembevételével Takács György 5. Előadás 2010. október 6.

Fundamentals of reliability issues in network planning • You must dimension networks higher parameters then the exact specification calculated from the demand parameters. Networks always need some spare capacity. • Calculate with unpredictable situations! • Reliability and availability dimensioning are important part of network dimensioning and planning. 2010. október 6.

Organizations are increasingly reliant on computer networks for business or mission-critical applications. The scope and size of these networks have expanded so rapidly over the past two decades that considerable effort and expense are now targeted at keeping network resources available, sometimes 24 hours a day, all year. Traditionally this area of network design has been the preserve of large mainframe sites and those sites requiring high levels of protection (such as nuclear power plants). However, the explosion of Web-based business methods means than many more organizations are now eager to maintain high availability in order to minimize service losses. 2010. október 6.

If the network is poorly designed, and insufficient attention is paid to providing availability in core systems, users can experience anything from slow response times to complete loss of service (referred to as downtime) for extended periods. The technical issues in maintaining high availability are both complex and subtle, and it is the network designer’s job to balance loss probability against cost, providing guidance to senior management on the likelihood of failures and their impact on the business. 2010. október 6.

Networks are rarely static environments, and budgets are finite. In practice network designers are required to make a range of pragmatic and technical decisions that address, accept, mitigate, or transfer the risks of failure—all within the constraints of a budget. The designer must also ensure that the solutions provided are scalable, so that additional nodes, services, and capacity can be added without major upheaval and without adversely affecting existing users. Downtime for truly business- and mission- critical systems can equate to losses of millions of dollars per minute; these organizations, therefore, demand high-availability (HA) networks and are often prepared to go to extraordinary lengths to achieve them. 2010. október 6.

2010. október 6.

Failure knows no boundaries in a network design, and the smallest component failure can effectively bring down a whole business without warning (e.g., a failed hard disk controller on your core e-business server could stop all transactions). For practical reasons organizations are invariably broken down into teams responsible for different aspects of IT (desktop support, communications, applications, database, cabling, etc.). When a problem occurs, it is all too common for application staff to blame the network and vice versa. To maintain HA networks, different disciplines must work together, both at the design phase and subsequently. Good diagnostic, monitoring, and management tools can also help. 2010. október 6.

Planning for failure When designing a reliable data network, network designers are well advised to keep two quotations in mind at all times: Anything that can go wrong, will go wrong —Murphy Whatever can go wrong will go wrong at the worst possible time and in the worst possible way . . . Expect the unexpected. (Számíts a váratlanra!) —Douglas Adams, The Hitchhiker’s Guide to the Galaxy 2010. október 6.

Failure refers to a situation where the observed behavior of a system differs from its specified behavior. A failure occurs because of an error, caused by a fault. The time lapse between the error occurring and the resulting failure is called the error latency. • Faults can be • hard (permanent) or • soft (transient). • For example, a cable break is a hard failure, • whereas intermittent noise on the line is a soft failure. 2010. október 6.

Single Point of Failure (SPOF) indicates that a system or network can be rendered inoperable, or significantly impaired in operation, by the failure of one single component. For example, a single hard disk failure could bring down a server; a single router failure could break all connectivity for a network. Multiple points of failure indicate that a system or network can be rendered inoperable through a chain or combination of failures (as few as two). For example, failure of a single router, plus failure of a backup modem link, could mean that all connectivity is lost for a net. Planning for failure In general it is much more expensive to cope with multiple points of failure and often financially impractical. 2010. október 6.

Fault tolerance indicates that every component in the chain supporting the system has redundant features or is duplicated. A fault-tolerant system will not fail because any one component fails (i.e., it has no single point of failure). The system should also provide recovery from multiple failures. Components are often overengineered or purposely underutilized to ensure that while performance may be affected during an outage, the system will perform within predictable, acceptable bounds. 2010. október 6.

Fault resilience implies that at least one of the modules or components within a system is backed up with a spare (e.g., a power supply). This may be in hot standby, cold standby, or load-sharing mode. In contrast with fault-tolerant systems, not all modules or components are necessarily redundant (i.e., there may be several single points of failure). For example, a fault-resilient router may have multiple power supplies but only one routing processor. By definition, one fault-resilient component does not make the entire system fault tolerant. 2010. október 6.

Disaster recovery is the process of identifying all potential failures, their impact on the system/network as a whole, and planning the means to recover from such failures. 2010. október 6.

Calculating the true cost of downtime Network designers are largely unfamiliar with financial models. It is, however, imperative in designing reliable networks that the designer gathers some basic financial data in order to cost justify and direct suitable technical solutions. The data may come from line managers or financial support staff and may not be readily collated. Without these data the scale of the problem is undefined, and it will be hard to convince senior financial and operational management that additional features are necessary. 2010. október 6.

To illustrate the point let us consider a hypothetical consumer-oriented business (such as an airline, car rental, vacation, or hotel reservation call center). The call center is required to be online 24 hours a day, 7 days a week, 365 days a year. The business has 800 staff involved in call handling (transactions), each with an average burdened cost of $25 an hour (i.e., the cost of providing a desk, heating, lighting, phone, data point, etc.). There is a small profit made on each transaction, plus a large profit on any actual sale that can be closed. We assume here that there are on average three sales closed per hour. 2010. október 6.

Cost of Idle Staff is calculated as (Headcount × Burdened Cost × Downtime). Production Losses are calculated as (Headcount ×Transactions per Hour × Profit per Transaction × Downtime). Lost Sales are calculated as (Headcount × Sales per Hour × Profit per Sale × Downtime). 2010. október 6.

2010. október 6.

Developing a disaster recovery plan All networks are vulnerable to disruption. Sometimes these disruptions may come from the most unlikely sources. Natural events such as flooding, fire, lightning strikes, earthquakes, tidal waves, and hurricanes are all possible, as well as fuel shortages, electricity strikes, viruses, hackers, system failures, and software bugs. History shows us that these events do happen regularly. As recently as 1999 and 2000 we saw the seemingly impossible: power shortages in California threatened to cripple Silicon Valley, and a combination of fuel shortages, train safety issues, and massive flooding …. 2010. október 6.

In fact, various studies indicate that the majority of system failures can be attributed to a relatively small set of events. These include, in decreasing order of frequency, natural disaster, power failure, systems failure, sabotage/viruses, fire, and human error. There is also a general consensus that companies that take longer than a full business week to get back online run a high risk of being forced out of business entirely (some analysts state as high as 50 percent). 2010. október 6.

A general approach to the creation of a Disaster Recovery (DR) : • Benchmark the current design—Perform a full risk assessment for all key systems and the network as a whole. Identify key threats to system and network integrity. Analyze core business requirements and identify core processes and their dependence upon the network. Assign monetary values of loss of service or systems. • Define the requirements—Based on business needs, determine an acceptable recovery window for each system and the network as a whole. If practical specify a worst-case recovery window and a target recovery window. Specify priorities for mission- or business-critical systems. 2010. október 6.

Define the technical solution—Determine the technical response to these challenges by evaluating alternative recovery models, and select solutions that best meet the business requirements. Ensure that a full cost analysis of each solution is provided, together with the recovery times anticipated under catastrophic failure conditions and lesser degrees of failure. • Develop the recovery strategy—Formulate a crisis management plan identifying the processes to be followed and key personnel response to failure scenarios. Describe where automation and manual intervention are required. Set priorities to clearly identify the order in which systems should be brought back online. 2010. október 6.

Develop an implementation strategy—Determine how new/additional technology is to be deployed and over what time period. Document changes to the existing design. Identify how new/additional processes and responsibilities are to be communicated. • Develop a test program—Determine how business- and mission-critical systems may be exercised and what the expected results should be. Define procedures for rectifying test failures. Run tests to see if the strategy works; if not, make refinements until satisfied. • Implement continuous monitoring and improvements—Once the disaster recovery plan is established, hold regular reviews to ensure that the plan stays synchronized as the network grows or design features are modified. 2010. október 6.

Disaster recovery models 2010. október 6.

Tape or CD site backup—Tape or CD-ROM backup and restore are the widely used DR methods for sites. Traditionally, key data repositories and configuration files are backed up nightly or every other night. Backup media are transported and securely stored at a different location. This enables complete data recovery should the main site systems be compromised. If the primary site becomes inoperable, the plan is to ship the media back, reboot, and resume normal operations. Pros and Cons: This is a low-cost solution, but the recovery window could range from a few hours to several days; this may prove unacceptable for many businesses. Media reliability may not be 100 percent and, depending upon the backup frequency, valuable data may be lost. 2010. október 6.

Electronic vaulting—With remote electronic vaulting, data are archived automatically to tape or CD over the network to a secure remote site. Electronic vaulting ideally requires a dedicated network connection to support large or frequent background data transfers; otherwise, archiving must be performed during off-peak periods or low-utilization periods (e.g., via a nightly backup). Backup procedures can, however, be optimized by archiving only incremental changes since the last archive, reducing both traffic levels and network unavailability. Pros and Cons: The operating costs for electronic vaulting can be up to four times more expensive than simple tape or CD backup; however, this approach can be entirely automated. Unlike simple media backup there is no requirement to transport backup data physically. Recovery still depends on the most recent backup copy, but this is likely to be more recent due to automation. Electronic vaulting is more reliable and significantly decreases the recovery window (typically, just a few hours). 2010. október 6.

Data replication/disk mirroring—Remote disk mirroring provides faster recovery and less data loss than remote electronic vaulting. Since data are transferred to disk rather than tape, performance impacts are minimized. With disk mirroring you can maintain a complete replica file system image at the backup site; all changes made to production data are tracked and automatically backed up. Data are typically synchronized in the background, and when the recovery site is initialized or when a failed site comes back online, all data are resynchronized from the replica to production storage. Note that data may be available only in read-only mode at the recovery site if the original site fails (to ensure at least one copy is protected), so services will recover but applications that are required to update data may be somewhat compromised unless some form of local data cache is available until the primary storage comes online. A disk mirroring solution should ideally be able to use a variety of disks using industrystandard interfaces (e.g., SCSI, Fibre Channel, etc.). 2010. október 6.

Data replication/disk mirroring Pros and Cons: Data replication is more expensive than the previous two models, and for large sites considerable traffic volumes can be generated. Ideally, a private storage network should be deployed to separate storage traffic from user traffic. Although more optimal, this requires more maintenance than earlier models. 2010. október 6.

Server mirroring and clustering—These techniques can be used to significantly reduce the recovery time to acceptable levels. Ideally, servers should be running live and in parallel, distributing load between them but located at different physical locations. If incremental changes are frequently synchronized between servers, then backup could be a matter of seconds, and only a few transactions may be lost (assuming there isn’t large-scale telecommunications or power disruption and staff are well briefed on what to do and what not to do in such circumstances). The increasing focus on electronic commerce and large-scale applications such as ERP means that this configuration is becoming increasingly common. 2010. október 6.

Server mirroring and clustering Pros and Cons: This approach is widely used at data centers for major financial and retail institutions but is often too expensive to justify for small businesses. Server mirroring requires more infrastructure to achieve (high-speed wide area links, more routers, more firewalls, and tight management and control systems). 2010. október 6.

Storage Area Networks (SANs) and Optical Storage Network (OSNs)—There is increasing interest in moving mission- and business- critical data off the main network and offloading it onto a privately managed infrastructure called a Storage Area Network (SAN). Storage can be optically attached via standard high-speed interfaces such as Fibre Channel and SCSI (with optical extenders), providing a physical separation of storage from 600 meters to 10 kilometers. Servers are directly attached to this network (typically via Fibre Channel or ESCON/FICON interfaces [5] and are also attached to the main user network. SANs may be further extended (to thousands of kilometers) via technologies such as Dense Wave Division Multiplexing (DWDM), forming optical storage networks. This allows multiple sites to share storage over reliable high-speed private links. 2010. október 6.

Storage Area Networks (SANs) and Optical Storage Network (OSNs) Pros and Cons: This approach is an excellent model for disaster recovery and storage optimization. It significantly increases complexity and cost (though storage consolidation may recover some of these costs), and it is, therefore, appropriate only for major enterprises at present. One big attraction for many large enterprises is that the whole storage infrastructure can be outsourced to a Storage Service Provider (SSP). This facilitates a very reliable DR model (some providers are currently quoting four-nines (99.99 percent) availability. 2010. október 6.

Quantifying availability • A% = Operational Time/Total Time 2010. október 6.

2010. október 6.

Mean Time Between Service Outages (MTBSO) or Mean Time Between Failure (MTBF) is the average time (expressed in hours) that a system has been working between service outages and is typically greater than 2,000 hours. Since modern network devices may have a short working life (typically five years), MTBF is often a predicted value, based on stress-testing systems and then forecasting availability in the future. Devices with moving mechanical parts such as disk drives often exhibit lower MTBFs than systems that use fixed components (e.g., flash memory). 2010. október 6.

Mean Time To Repair (MTTR) is the average time to repair systems that have failed and is usually several orders of magnitude less that MTBF. MTTR values may vary markedly, depending upon the type of system under repair and the nature of the failure. Typical values range from 30 minutes through to 3 or 4 hours. A typical MTTR for a complex system with little inherent redundancy might be several hours. 2010. október 6.

2010. október 6.

Soros rendszerre: 2010. október 6.

2010. október 6.

Network planning considering reliability aspects Hálózatok tervezése megbízhatósági szempontok figyelembevételével