Maximizing IP/MPLS Network Availability - Methods and Calculations

Availability of IP/MPLS networks Sanjay Kalra October 2002

Agenda • Introduction • How to measure Availability • Network Design example • One Router vs. Two Routers • Software Dependability • Summary 2

Definition of Availability Availability is the probability that an item will be able to perform its designed functions at the stated performance level, within the stated conditions and in the stated environment when called upon to do so. Availability = Reliability Reliability + Recovery

Quantification

PSTN End-2-End Availability 99.94% PSTN : The Yardstick ? • Individual elements have an availability of 99.99% • One Cut off call in 8000 calls (3 min for average call). Five ineffective calls in every 10,000 calls. NI NI 0.005 % 0.005 % AN 0.01 % AN 0.01 % LE LE Facility Entrance Facility Entrance NI : Network Interface LE : Local Exchange LD : Long Distance AN : Access Network LD 0.005 % 0.005 % 0.02 % Source : http://www.packetcable.com/downloads/specs/pkt-tr-voipar-v01-001128.pdf

Services affect on Network Availability • In IP Network Availability is a function of the Service being offered. Source : www.t1.org

IP Network Expectations H L L L : Low M : Medium H : High

Agenda • Introduction • How to measure Network Availability • Network Design example • One Router vs. Two Routers • Software Dependability • Summary 8

The Port Method • Based on Port count in Network • Does not take into account the Bandwidth of ports e.g. OC-192 and 64k are both ports • Good for dedicated Access service because ports are tied to customers. (Total # of Ports X Sample Period) - (number of impacted port x outage duration) x 100 (Total number of Ports x sample period)

The Port Method Example • 10,000 active access ports Network • An Access Router with 100 access ports fails for 30 minutes. • Total Available Port-Hours = 10,000*24 = 240,000 • Total Down Port-Hours = 100*.5 = 50 • Availability for a Single Day = (240000-50/240,000)*100 = 99.979166 %

The Bandwidth Method • Based on Amount of Bandwidth available in Network • Takes into account the Bandwidth of ports • Good for Core Routers (Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration) x 100 (Total amount of BW in network x sample period)

The Bandwidth Method Example • Total capacity of network 100 Gigabits/sec • An Access Router with 1 Gigabits/sec BW fails for 30 minutes. • Total BW available in network for a day = 100*24 = 2400 Gigabits/sec • Total BW lost in outage = 1*.5 = 0.5 • Availability for a Single Day = ((2400-0.5)/2,400)*100 = 99.979166 %

(number of impacted customers x outage duration) ] x 10-6 DPM = [ (total number of customers x sample period) Defects Per Million • Used in PSTN networks, defined as number of blocked calls per one million calls averaged over one year.

Defects Per Million Example • 10,000 active access ports Network • An Access Router with 100 access ports fails for 30 minutes. • Total Available Port-Hours = 10,000*24 = 240,000 • Total Down Port-Hours = 100*.5 = 50 • Daily DPM = (50/240,000)*1,000,000 = 208

Calculating Availability: Series E1 E2 E3 Multiplicative method:E1 x E2 x E3= As .999999 x .999999 .999991 x = .9999890 Additive method of UA (unavailability) = .0000110 .000001 + .000001 + .000009 Total Availability of a system (As) is always less than the least available element. One Weak Link Significantly Weakens This Chain!

E1 E2 Calculating Availability: Parallel For 1 out of 2 redundancy.. Additive Rule: As = E1+ E2 – E1 E2 As = .999999+.999999-(.999999*.999999) As = .999999999999 Multiplicative Rule: As = 1–[(1-E1)(1-E2)] Not for Parallel Systems Where Both Elements Are Required Assumption is that Switchover Time is zero

System Calculation: Series Simple E-3 Network, With One E-3 Trunk E-3 Server ATM ATM 1 4 2 3 5 99.98 99.99 99.992 99.992 99.95 99.9959 99.9959 99.9959 99.9959 99.9959 Availability 99.8835% Yearly downtime = (1-Availability) * 525600 minutes/year

System 1 availability 99.6341 99.9845 99.9831 99.9831 99.9831 99.9831 99.9563 99.95 99.975 99.8200 99.9750 99.9932 99.975 99.82 99.82 99.95 99.9831 99.9563 99.9831 99.9831 99.9831 99.9831 99.9831 Systems 2 availability 99.4311 S1 & S2 network 99.9979 Availability, Data Centre to Customer CPE 99.9661% System Calculation: Parallel (1) Internet Gateway Data Centre Core Edge CPE E-3 Edge ATM Hub Core Server STM-16 STM-1 Core E-3 Edge ATM Hub Data Centre Core Core Core

System 1 Availability 99.6958 System 2 Availability 99.4828 Availability, Data Centre to Customer CPE 99.9974% System Calculation: Parallel (2) 99.9845 was 99.6341 Internet Gateway 99.9831 99.9831 99.9831 99.9831 Data Center 99.9932 Core Edge Edge Core NxE-1 99.999 99.975 99.8200 Server 99.9850 CPE STM-16 STM-1 Core E-3 99.9850 99.975 99.82 99.82 99.999 Edge NxE-1 Edge Data Center Core 99.9831 99.9932 99.9831 99.9831 99.9831 Core Core was 99.4311 99.9831 99.9831 was 99.9661 !!! 3 9’s to 4 9’s

Router Redundancy • Typical Network Designs have 2 routers for • Redundancy • Capacity Planning • Redundancy in routers • Power Supply • Fans • Routing Engines • Switching Planes • Forwarding plane Do we still need two routers or one is enough?

One Router Versus two Routers Redundant • Control Plane • Forwarding Plane • Power Supply • FAN • Line Card Link Availability = Router Availability 99.99979 Router Full Internal Redundancy (99.99979) HW Cost of two Router Configuration is 110%of one router configuration OC-48 LH No Redundancy at Router Level (99.99015) Link Availability = Parallel System Availability 99.999999

One Router Advantages • Cost Savings • Lower OPEX • Faster convergence • For some PE Routers Single Router might be the only option!! • As Service State is maintained on per flow basis for some network based services (e.g. Firewall, NAT) • TDM links are usually connected to a single edge router • A lot of customers terminate on a single router

One Router Disadvantages • Single Point of failure • Configuration and Upgrade has to be exact • Capacity Management has to be exact • Main cost of a router is line cards and not chassis • What if there is a DOS attack against the router ?

One Router Disadvantages • Physical Maintenance is not possible without downtime (Location Change) • Still need protection against link failure • Physical separation to prevent against natural disasters is not possible • Networks have been always designed with two routers !!!

SW to HW Reliability Differences • Software reliability is not a function of manufacturing • Software does not degrade over time • Physical Environmental changes have no affect • All software failures are the result of design/user errors

SW to HW Reliability Differences • Software can only be repaired by redesign • MTTR is not measurable since code must be rewritten to fix a bug. • Software bugs can be highly contagious • The science of software correctness is still immature and is difficult to apply to software as complex and quickly changing as IP routing

Summary • No standard way to measure IP Availability • Availability in IP networks depends on the Service being offered • One vs. two Routers choice depends on requirements • Lot of development happening in IP networks to improve Availability • Graceful Restart, NSF, Fast Reroute …

Maximizing IP/MPLS Network Availability - Methods and Calculations