1 / 49

RON: Resilient Overlay Networks

RON: Resilient Overlay Networks. David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris http://nms.lcs.mit.edu/ron/. Overlay Networks. An overlay network is a computer network which is built on top of another network. Nodes are connected by virtual links.

reya
Download Presentation

RON: Resilient Overlay Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris http://nms.lcs.mit.edu/ron/

  2. Overlay Networks • An overlay network is a computer network which is built on top of another network. • Nodes are connected by virtual links. • Each correspond to a path in the underlying network • What if you want to experiment a new routing protocol? • What if you wanted a network that provides new capabilities that are valuable to some peers and applications? • Overlay networks are not new • Gnutella, Chord, Pastry, Kelips, VPN… • RON is an overlay upon underlying Internet

  3. Why RON? • BGP scales well, but is not fault-tolerant • Detailed information only inside AS’s, information between AS’s is filtered and summarized • Some links are invisible, preventing BGP from doing a good decision • BGP’s fault recovery takes many minutes • 3min minimum detection + recovery time; often 15 mins (Labovitz 97-00) • 40% of outages took 30+ mins to repair (Labovitz 97-00) • 5% of faults last more than 2.75 hours (Chandra 01) • Link outages are common • 10% of routes available < 95% of the time (Labovitz 97-00) • 65% of routes available < 99.9% of the time (Labovitz 97-00) • 3.3% of all routes had serious problems (Paxson 95-97) • Route selection in BGP uses fixed and simple metrics

  4. Motivation: Network Redundancy • Multiple paths exist between most hosts • Many paths are hidden due to private peering • Indirect paths may offer better performance • Non-transitive reachability • A and C can’t reach each other but B can reach them both • Try to exploit redundancy in underlying Internet

  5. Motivation: Network Redundency

  6. RON’s Goal • Fast failure detection and recovery • In seconds (reduced by a factor of 10) • Tighter integration with application • Fatal for one application may be acceptable for another • Optimize routes for latency, throughput, etc. • Expressive policy routing • Fine-grained policy specification • e.g. keep commercial traffic off Internet2

  7. What RON can do? • Videoconferencing • Multi-person collaborations • Virtual Private Networks (VPNs) across public Internet • Branch offices of companies

  8. What does RON do? • Small network: 3-50 nodes • Continuous measurement of each pair-wise link • A trade-off between scalabilityand recovery efficiency • Compute path properties • Based on different metrics, e.g. latency, loss rate… • Pick best path out of direct and indirect ones • One indirect hop is enough • Forward traffic over that path

  9. Design Set of API’s used by a RON Client to interact with RON Router computes the forwarding tables(Link-state dissemination through RON) Receives and sends new packets, asks router for the best path, and also passively probes for performance data

  10. Failure Detection • Active monitoring: send probes on each virtual link • Probe interval: 12 seconds • Probe timeout: 3 seconds • Routing update interval: 14 seconds • Passive measurement • Detect failure in under 20s • Faster than any TCP timeout

  11. Policy Routing • RON allows users or administrators to define the types of traffic allowed on particular links • Traditionally, routing is based on destination and source addresses, but RON allows for routing based on other information • Router computes a forwarding table for each policy • Packets classified with policy tag and routed accordingly

  12. RON overhead • Probe packet: 69 bytes • Probing and routing state traffic - grows O(N2) • Restricted size: one node in one site • To achieve 12~25 seconds recovery: • Reasonable overhead: 10% of the bandwidth of broadband Internet

  13. Experiments • Real-world deployment of RON at several Internet sites. • RON1: 12 hosts in the US and Europe • 64 hours of measurements in March 2001 • RON2: 16 hosts • 85 hours of measurements in May 2001 • In RON1, outage detection and path selection mechanisms were successfully able to route 100% of outage situations, while RON2 achieved a 60% success rate.

  14. Experiments: Loss Rate • Implemented in RON1 • Averaged over 30 mins • The samples detect unidirectional loss Reason: RON router uses bi-directional information to optimize uni-directional loss rates

  15. Experiments: Latency • Implemented in RON1 • 5-mins average latencies • CDF

  16. Experiments: Throughput • Implemented in RON1 • Totally 2035 samples • 5% doubled throughput, while 1% received < 50%

  17. Experiments: Flooding Attack • On Utah Network Emulation Testbed • Attack begins at 5th second • Taking 13 seconds to reroute the connection • A receiver-side TCP sequence trace

  18. Drawbacks • NAT • Naming: cache a “reply to” address/port pair • Two hosts are both behind NATs: treat as an outage, attempt to route around it • Violation of AUPs and BGP transit policies • RONs are small, this can be resolved at an administrative level

  19. Thoughts • RON deals with failure recovery and let the Internet focus on scalability • Provide implementation details for their idea • Use overlay networks to solve the path failure detection and recovery problem • Overlay to network is like virtual machine to computer

  20. Discussions • Is RON scalable? How many nodes can be in RON? • What’s the bad side of fine-grain policy routing? Can it work on usual PCs? • What happens if lots of overlay networks built on top of the Internet? • How does node distribution affect performance? • Why does RON cause an increase in the average latency in some fast paths? • Overhead • Why use link state but not distance vector in routing table built-up? • Speed of convergence • Size

  21. DHT Distributed Hash Table

  22. Flash Back • DHT = Distributed Hash Table • A distributed service that provides hash table semantics • Two fundamental operations: Insert, Lookup • Performance concerns • Operation complexity • Load balance • Locality • Maintenance • Fault tolerance • Scalability

  23. Flash Back • Napster • O(1) message for every operation • Server becomes single point of failure • Gnutella • O(N) lookup latency • Chord • O(log(N)) lookup latency • Stabilization protocol not efficient

  24. Pastry • Some similarity between Pastry and Chord • Use SHA-1 to generate node ID and message key value. • File is stored at the node whose id is close to its key • Data lookup, insertion takes O(log(N)) time • Some aspects Pastry outperforms Chord • Provides better locality than Chord • Better membership maintenance

  25. Design of Pastry • Think id and key as a series of digits with base 2b • In each round, the message is one digit closer to the destination • Each Pastry node maintains three tables: • Routing table • Neighborhood set • Leaf set

  26. Pastry – Routing Table • A (log2^b(N))*2b table • The (i,j) term is a node such that: • Share the same first (i-1) digits in their id • The ith digit of the node’s id is j • Typical value of b is 4 • Resembles finger table of Chord

  27. Pastry – Neighbor Set & Leaf Set • Neighbor set M • |M| nodes that are closest (according to the proximity metric) to the host • Used to maintain locality properties • Leaf set L • |L| nodes that have closest id to the host • Divided into two sets: |L|/2 nodes with id greater than the host and |L|/2 nodes with id smaller than the host • Resembles successor/predecessor in Chord • Typical value of |L|,|M|: 2b or 2*2b

  28. kth element differs from the node only by the last k bits => resembles routing table Numerically closest to the node => resembles leaf set Chord

  29. Pastry

  30. Pastry – Routing • Given a message with key k: • Check whether k is covered by the range of leaf set • If so, forward to the proper node and we are done • Check routing table to see if there is a node that is one digit closer to k • When the above check fails, forward to the numerically closest node in routing table, leaf set, and neighbor set. • Case 3 is unlikely • 2% when |L| = 2b, 0.6% when |L| = 2*2b • Expected routing steps is O(log2^b(N))

  31. Pastry – Node Join • Suppose X wants to join and it knows a nearby node A • A uses routing protocol to find Z • Every node on the route from A to Z sends routing tables to X • X constructs its routing table according to the received ones • X uses the leaf set of Z and neighbor set of A as its own • X informs any node in its table

  32. Pastry – Node Leave • Passive failure detection: no heartbeat! • Failure is discovered only when a node tries to send message to the failed node • If the leaving node is in the leaf set • Ask the largest/smallest node in the leaf set to obtain a replacement • If the leaving node is in the routing table • Ask nodes in the same row to obtain a replacement • If the leaving node is in the neighbor set • Not specified, but can be easily done by asking other nodes in the neighbor set

  33. Pastry -- Locality • A node tends to select nearby nodes to put into its routing table • A joining node X gets its routing table from nodes on the route from A to Z, which are also close to X • To further improve locality, X obtains routing tables from its neighbors and updates its own

  34. Pastry – Locality • There are fewer and fewer nodes as the number of level goes up • The optimal distance increases as the number of level goes up • Let the route from A to Z be A->B1->B2->…->Z • Bi will be a reasonable choice for the ith row of X since it is in the (i-1)th row of Bi-1

  35. Pastry – Experiment on efficiency b = 4, |L| = 16, |M| = 32, 200,000 lookups

  36. Pastry – Experiment on efficiency b = 4, |L| = 16, |M| = 32, N = 100,000, and 200,000 lookups

  37. Pastry – Experiment on locality b = 4, |L| = 16, |M| = 32, and 200,000 lookups

  38. Pastry – Experiment on locality SL: no locality WT: no 2nd stage WTF: with 2nd stage b = 4, |L| = 16, |M| = 32, N = 5,000

  39. Pastry – Experiment on fault tolerance b = 4, |L| = 16, |M| = 32, N = 5,000 with 500 failing Only average over affected queries

  40. Kelips • O(1) file lookup cost • O(√N) storage per node • Use gossip to implement multicast • Have very good resistance towards nodes fail • Only store metadata of files

  41. Kelips • The nodes are divided evenly into √N affinity groups • Use SHA-1 to obtain id • Divide id by to determine which affinity group to join • Every node maintains the following tables • Affinity Group View: the set of nodes in the same affinity group • Contacts: a constant-sized (2 in the implementation) set of nodes in every affinity group • Filetuples: a set of tuples representing every file stored in the affinity group

  42. Kelips – An Example

  43. Kelips – Insertion • Insertion: • Obtain the hash value of the file and find the corresponding affinity group • Send the request to the closest contact of that affinity group • The contact randomly picks a member in the affinity group to store the metadata • The storing node uses gossiping to disseminate the message • O(1) complexity, O(log N) for gossiping

  44. Kelips – Data Lookup • Data lookup: • Obtain the hash value of the file and find the corresponding affinity group • Send the request to the closest contact of that affinity group • The contact lookup its filetuple table and return the IP of the node holding the file • 2 message transmission time

  45. Kelips – Maintenance • Affinity view, contact, and filetuple all require heartbeats to keep from expiring • Use gossiping to send heartbeats: • Gossiping messages consist of a fixed size of recently received information • May need to divide information into several packets • In every round, a node randomly chooses a constant-sized set to forward information • To improve locality, a node tends to choose nodes that are close to it • Incur constant traffic • Nodes join/leave is trivial to handle

  46. Kelips – Experiment on Load Balance N = 1500 with 38 affinity groups Load balance is particular important in Kelips

  47. Kelips – Experiment on Insertion N = 1000 with 30 affinity groups Note: No failure at all

  48. Kelips -- Experiment N = 1000 with 30 affinity groups 500 nodes are deleted at time t = 1300

  49. Comparison and Thought • Pastry requires O(log(N)) storage and O(log(N)) lookup complexity • Passive failure detection • Save bandwidth, but may not deal with frequent node join/leave • Security: what if some nodes are malicious • Keep redundant routing tables and randomly choose among them • Replicate data among numerically nearby nodes • Kelips uses O(√N) storage and O(1) lookup complexity • Sacrifice memory for efficiency • May not scale to a million of nodes • Size must know in advance • Not a serious issue if the size doesn’t change dramatically • Adaptive membership maintenance • Security • We can replicate metadata, but it will use up more bandwidth

More Related