Availability of network application services
Download
1 / 55

Availability of Network Application Services - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Availability of Network Application Services. Candidacy Exam Jong Yul Kim February 3, 2010. Motivation. Critical systems are being deployed on the internet, for example: Next Generation 9-1-1 Business transactions Smart grid How do we enhance the availability of these critical services?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Availability of Network Application Services' - kerry-daniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Availability of network application services

Availability of Network Application Services

Candidacy Exam

Jong Yul Kim

February 3, 2010


Motivation
Motivation

  • Critical systems are being deployed on the internet, for example:

    • Next Generation 9-1-1

    • Business transactions

    • Smart grid

  • How do we enhance the availability of these critical services?


Scope
Scope

  • What can application service providers do to enhance end-to-end service availability?

  • The problems are that:

    • The underlying internet is unreliable.

    • Servers fail.

X

X


Scope1
Scope

  • Level of abstraction: system design

    • On top of networks as clouds (but we can probe it.)

    • Using servers as nodes

  • The following are important for availability, but not in scope:

    • Techniques that ISPs use.

    • Techniques to enhance availability of individual components of the system.

    • Defense against security vulnerability attacks.


Contents
Contents


The underlying internet is unreliable
The underlying internet is unreliable.

  • Symptoms of unavailability in the application

    • Web : “Unable to connect to server.”

    • VoIP : call establishment failure, call drop[2]

  • Symptoms of network during unavailability

    • High percentage of packet loss

    • Long bursts of packet loss[1,2]

      • 23% of lost packets belong to outages[2]

    • Packet delay[2]

1 Dahlin et al. End-to-End WAN Service Availability. IEEE/ACM TON 2003.

2 Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.


The underlying internet is unreliable1
The underlying internet is unreliable.

  • Network availability figures

    • Paths to popular web servers: 99.6%[1]

    • Call success probability: 99.53%[2]

1 Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.

2 Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.


The underlying internet is unreliable2
The underlying internet is unreliable.

  • Possible causes of these symptoms

    • Network congestion

    • BGP updates

      • 90% of lost packets near BGP updates are part of a burst of 1500 packets or longer. At least 30 seconds of continuous loss.[1]

    • Intra-domain link failure[1,2]

    • Software bugs in routers[2]

1 Kushman et al. Can you hear me now?!: it must be BGP. SIGCOMM CCR 2007.

2 Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.


The underlying internet is unreliable3
The underlying internet is unreliable.

  • Challenges

    • Quick re-routing around network failures

    • With limited control of the network



Resilient overlay network 1
Resilient Overlay Network[1]

  • An overlay of cooperative nodes inside the network

  • Each node is like a router

    • It has its own forwarding table

    • Link-state algorithm used to construct map of the RON network

  • They actively probe the path between themselves

    • Fully meshed

    • = Fast detection of path outages

1 Andersen et al. Resilient Overlay Networks. SOSP 2001.


Resilient overlay network
Resilient Overlay Network

  • Outage defined as

    outage(r,p) = 1 when

    Observed packet loss rate averaged over an interval r is larger than p on the path.

  • “RON Win” defined as

    when average loss rate on the Internet ≥p,

    RON loss rate < p

  • Results on right

    r = 30 minutes

RON1

12 nodes

>64 hours

RON2

16 nodes

>85 hours

1 Andersen et al. Resilient Overlay Networks. SOSP 2001.


Resilient overlay network1

Merits

Re-routes quickly around network failures

Good recovery from core network failures

Limitations

Does not recover from edge failures or last-hop failures

Active probing is expensive

Trades off scalability for reliability[1]

Clients and servers cannot use RON directly.

Need to have a strategy to deploy RON nodes.

May violate ISP contracts

Resilient Overlay Network

1 Andersen et al. Resilient Overlay Networks. SOSP 2001.



One hop source routing
One-Hop Source Routing

Intermediaries

X

  • Main idea is to try another path through an intermediary if mine does not work.

    • “Spray and pray”

    • Issue a random packet to 4 others every 5 seconds and see if they succeed (called random-4)

1 Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.


One hop source routing1
One-Hop Source Routing

  • Results

1 Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.

.


One hop source routing2
One-Hop Source Routing

  • Results

1 Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.

.


One hop source routing3

Merits

Simple, stateless

Good for applications where users are tolerant of faults

Good for rerouting around core network failures

Limitations

Not always guaranteed to find alternate path

Cannot reroute around edge network failures

Intermediaries

How does the client find them?

Where to place them?

Will they cooperate?

One-Hop Source Routing

1 Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.

.



Multi homing
Multi-homing

  • Use multiple ISPs to allow network path redundancy at the edge network.

Figure: Availability gains by adding best 2 and 3 ISPs based on RTT performance.

(Overlay routing achieved 100% availability during the 5-day testing period.)[1]

1 Akella et al. A Comparison of Overlay Routing and Multihoming Route Control. SIGCOMM 2004.


Multi homing1
Multi-homing

  • Availability depends on choosing:[1]

    • Good upstream ISPs

    • ISPs that do not have path overlaps upstream

    • Hints at the need for path diversity

  • Some claim that gains are comparable to overlay routing.[2]

1 Akella et al. A Measurement-Based Analysis of Multihoming. SIGCOMM 2003.

2 Akella et al. A Comparison of Overlay Routing and Multihoming Route Control. SIGCOMM 2004.



Server failures
Server failures.

  • Hardware failures

    • More use of commodity hardware and components“Well designed and manufactured HW: >1% fail/year.”[1]

  • Software failures

    • Concurrent programs are prevalent and so are bugs.

      • Cases of introducing another bug while trying to fix one.[1]

  • Environmental failures

    • Electricity outage

    • Operator error

1 Patterson. Recovery oriented computing: A new research agenda for a new century. Keynote address, HPCA 2002.

2 Lu et al. Learning from Mistakes – A comprehensive study on real world concurrency bug characteristics. ASPLOS 2008.



One site clustering
One-site Clustering

  • Three models in organizing servers in a cluster[1]

    • Cluster-based Web system

    • Virtual Web cluster

    • Distributed Web system

 one Virtual IP address exposed to client

 one Virtual IP address shared by all servers

 IP addresses of servers exposed to client

1 Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.


Cluster based web system
Cluster-based Web System

  • One IP address exposed to client

  • Request routing

    • By the web switch

  • Server selection

    • Content-aware

    • Client-aware

    • Server-aware

1 Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.


Virtual web system
Virtual Web System

  • All servers share same IP address

  • Request routing

    • All servers have same MAC address

    • Or layer 2 multicast

  • Server selection

    • By hash function

1 Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.


Distributed web system
Distributed Web System

  • Multiple IP addresses exposed

  • Request routing

    • Primary by DNS

    • Secondary by server

  • Server selection

    • By DNS

1 Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.



Improving availability of press 1
Improving Availability of PRESS[1]

Distributed Web Model

  • PRESS Web Server

    • Cooperative nodes

    • Serve from cache instead of disk

    • All servers have a map of where cached files are

  • Availability mechanism

    • Servers are organized in a ring structure. Each sends heartbeat to next one in the ring.

    • Three heartbeats missing: predecessor has crashed.

    • Node restarts – broadcasts its IP address to join cluster again.

1 Nagaraja et al. Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services. SC 2003.


Improving availability of press 11
Improving Availability of PRESS[1]

Cluster-based Web Model!

  • To enhance availability

    • Add front-end node to mask failures

      • Layer 4 switch (LVS) with IP tunneling

    • Robust group membership

    • Application-level heart beats

    • Fault Model Enforcement

      • If there’s fault, crash whole node and restart.

  • 99.5%  99.99%

  • Cluster-based model!

1 Nagaraja et al. Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services. SC 2003.


Effects of design on service availability1

Best choice!

Effects of design on service availability



Reliable server pooling rserpool
Reliable Server Pooling (RSerPool)

“RSerPool provides an application-independent set of services and protocols for building fault-tolerant and highly-available client/server applications.”[1]

  • Specifies behavior of registrar, server, and client

  • Two protocols designed over SCTP

    • Aggregate Server Access Protocol (ASAP)

      • Used between server - registrar and client - registrar

    • Endpoint haNdlespace Redundancy Protocol (ENRP)

      • Used among registrars

1 Lei et al. An Overview of Reliable Server Pooling Protocols. RFC 5351. IETF 2008.


Reliable server pooling rserpool1
Reliable Server Pooling (RSerPool)

Each registrar is home registrar to a set of servers. Each maintains a complete view of pools by synchronizing with each other.

  • Servers can add itself by registering to a registrar. That registrar becomes the home registrar.

  • Home registrar probes server status.

  • Clients cache list of servers from registrar.

  • Clients also report failures to registrar.

Servers provide service through normal application protocol such as HTTP.


Registrars
Registrars

  • Announce themselves to servers, clients, and other registrars using IP multicast or by static configuration.

  • Share all knowledge about pool with other registrars

    • All server updates (registration, re-registration, deregistration) are announced by home registrar.

    • Maintain connections to peer registrars using SCTP

    • Checksum mechanism used to audit consistency of handlespace.

  • Keeps the number of unreachability reports for each server in its pool and if some threshold is reached, the server is removed from the pool.


Fault detection mechanisms 1
Fault detection mechanisms[1]

1 Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.


Reliability mechanisms 1
Reliability mechanisms[1]

  • Registrar takeover

    • Each registrar already has a complete view

    • Registrars volunteer to take over the failed one

    • Conflict resolved by comparing registrar ID (lowest one wins)

    • Send probes to servers under failed registrar

1 Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.


Reliability mechanisms 11
Reliability mechanisms[1]

  • Server failover

    • Uses client-side state-keeping mechanism

      • A state cookie is sent to the client via control channel. The cookie serves as a checkpoint.

      • When server fails, client connects to a new server and sends the state cookie. The new server picks up from that state.

      • Uses ASAP session layer between client and server.

      • Data and control channel are multiplexed in over a single connection to enforce correct order of transaction and cookie.

1 Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.


Evaluation of rserpool

Merits

Includes the client in system design

Every element keeps an eye on each other

There are mechanisms for registrar takeover and server failover

Multiple layers of failure detection. (e.g. SCTP level and application level.)

Limitations

The client doesn’t do much to help when unavailability problems occur

Too much overhead in maintaining consistent view of the pools. Not scalable to many servers.

Use of IP multicast confines deployment of registrars and servers to multicast domain. Most likely, clients would have to be statically configured.

Assumes that client connectivity problem is a server problem. Servers at one site will all get dropped if there’s a big network problem.

Evaluation of RSerPool



Distributed clusters
Distributed Clusters

  • Distributed clusters

    • Intrinsically has both network path and server redundancy

    • How can we utilize these redundancies?

  • Study of availability techniques of

    • Content Distribution Networks

    • Domain Name System



Akamai vs limelight 1

Akamai

Philosophy

Go closer to client

Scatter small clusters all over the place

Scale

27,000 servers

65 countries

656 ASes

Two level DNS

Limelight

Philosophy

Go closer to ISPs

Large clusters at few key locations near many ISPs

Scale

4,100 servers

18 countries

Has own AS

Flat DNS, uses anycast

Akamai vs. Limelight[1]

1 Huang et al. Measuring and Evaluating Large-Scale CDNs. IMC 2008.


Akamai vs limelight
Akamai vs. Limelight

  • Methodology

    • Connect to port 80 once every hours.

    • Failure = two consecutive connection error.

  • Results

    • Server and cluster availability is higher for Limelight.

  • But service availability may be different!

1 Huang et al. Measuring and Evaluating Large-Scale CDNs. IMC 2008.


Akamai failure model
Akamai Failure Model

  • Failure model

    “We assume that a significant and constantly changing number of components or other failures occur at all times in the network.”[1]

    • Components: link, machine, rack, data center, multi-network…

    • This leads having small clusters scattered in diverse ISPs and geographic locations.

1 Afergan et al. Experience with some Principles for Building an Internet-Scale Reliable System. USENIX WORLDS 2005.


Akamai scalability and availability
Akamai: scalability and availability

  • Use two-level DNS to direct clients to server

    • Top level directs clients to a region e.g. g.akamai.net

    • Region resolves lower level queries.

    • Top level returns low-level name servers in multiple regions

    • Low level returns short DNS TTL (20 seconds)

    • Servers use ARP to takeover a failed server

  • Use internet for inter-cluster communication

    • Uses multi-path routing that’s directed by SW logic = overlay network?

1 Afergan et al. Experience with some Principles for Building an Internet-Scale Reliable System. USENIX WORLDS 2005.



Dns root servers 1 2
DNS Root Servers[1,2]

  • Redundancy

    • Redundant hardware that takes over failed one with or without human intervention

      • At least 3 recommended, with one in a remote site[3]

    • Backups of the zone file stored at off-site locations

    • Connectivity to the internet

  • Diversity

    • Geographically located in 130 places in 53 countries

      • Topological diversity matters more

    • Hardware, software, operating system of servers

    • Diverse organizations, personnel, operational processes

    • Distribution of zone files within root server operator

1 Bush et al. Root Name Server Operational Requirements. RFC 2870. IETF 2000.

2http://www.icann.org/en/committees/security/dns-security-update-1.htm

3 Elz et al. Selection and Operation of Secondary DNS Servers. RFC 2182. IETF 1997.


The use of anycast for availability 1
The use of anycast for availability[1]

  • Basic anycast

    • Announce identical IP address

    • Routing system takes client request to closest node

  • Hierarchical anycast

    • Global vs. local nodes

    • If any node fails, stop announcement

    • Global node takes over automatically

1 Abley, Hierarchical Anycast for Global Service Distribution. ISC Technical Note 2003-1. 2003.


Is anycast good for everyone 1
Is anycast good for everyone?[1]

  • Not really…

  • Packets for long sessions may go to another node if the routing dynamics change

    • Service time and stability of routing

  • A lot of routing considerations

    • Aggregated prefixes

    • Multiple services from a prefix

    • Consideration of route propagation radius

1 Abley and Lindqvist, Operation of Anycast Services. RFC 4786. IETF 2006.



Conclusion
Conclusion

  • Techniques presented here attempt to recover from either network failure or server failure through redundancy

    • There is redundancy in network path

    • Server redundancy can be handled

  • An application service provider may have to use a combination of these techniques.


Conclusion1
Conclusion

  • CDN, DNS has been good so far…

    • Can these be applied to other applications?

  • System designed for availability is only as good as the failure model.

  • Further research on availability

    • Few techniques

    • Not so much about availability in literature.


ad