1 / 26

Self-healing in Routing: Failure Analysis, and Improvements

Self-healing in Routing: Failure Analysis, and Improvements. Qi Li Tsinghua University Aug. 28, 2008. Outline. Problem Statement Analysis of Self-Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work. Problem Statement.

kermit
Download Presentation

Self-healing in Routing: Failure Analysis, and Improvements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

  2. Outline • Problem Statement • Analysis of Self-Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future work AsiaFI, Student Workshop

  3. Problem Statement • Routing (Intra- and Inter- domain) is critical elements as Internet infrastructure • How robust are they against large scale failures/attacks? • Cisco routers caused major outage in Japan 2007 • Earthquake in Taiwan causes undersea cable damage in 2006 • We need to improve them, but how can we do? AsiaFI, Student Workshop

  4. Internet Routing • Not a homogeneous network • A network autonomous systems (AS) • Each AS under the control of an ISP. • Large variation in AS sizes – typical heavy tail. • Inter-AS routing • Border Gateway Protocol (BGP). A path-vector algorithm. • Serious scalability/recovery issues. • Intra-AS routing • Several algorithms; usually work fine • Central control, smaller network, … AsiaFI, Student Workshop

  5. Measurements – Prefix Growth • Table sizes grow 2x faster than real growth • One (conservative) analysis predicts 2M entries in 10 years AsiaFI, Student Workshop

  6. Measurements – BGP Updates AsiaFI, Student Workshop

  7. Distribution of Updates – Main Observation • Most of the network is very stable • Parts of the network are very unstable • Everybody pays for the instability • Problem is getting worse AsiaFI, Student Workshop

  8. Routing Failure Causes • Large area router/link damage (e.g., earthquake) • Large scale failure due to buggy SW update. • High BW cable cuts • Router configuration errors • Aggregation of large un-owned IP blocks • Happens when prefixes are aggregated for efficiency • Incorrect policy settings resulting in large scale delivery failures • Network wide congestion (DoS attack) • Malicious route advertisements via worms AsiaFI, Student Workshop

  9. Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-healing Solution • Conclusion and Future work AsiaFI, Student Workshop

  10. Existing Routing Protocols • Normal process of IP-based self-healing routing • Failure Detection • Failure Notification • Forwarding Path Re-computation • Existing routing protocols … • RIP: hundreds of seconds, count to infinity • OSPF, tens of seconds • BGP, several minutes or longer, can’t converge due to policy confliction. AsiaFI, Student Workshop

  11. The State Transition under Failure • A simple state transition to analyze the routing convergence. AsiaFI, Student Workshop

  12. The Problems of Transient Failures • Routing Blackhole • Traffic is silently dropped without informing the source that the data did not reach its intended recipient. • Routing Loop • The path to a particular destination forms a loop. AsiaFI, Student Workshop

  13. Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future work AsiaFI, Student Workshop

  14. Traditional Fast Reroute Solutions • Major improvement in Intra-domain routing is fast reroute solutions. • SONET rings are significantly reduce this recovery time, but they are expensive. • FRR with MPLS-TE, hard to deploy because it will introduce much complexity into core network. • IP-FRR developed by IETF, which still has some shortcomings, e.g., LFA needs a neighbor with a shortest path not containing the failed nodes. • Layer 3 Tunnel provides pre-computed path protection, which may not eliminate the routing loops introduced by tunneling. AsiaFI, Student Workshop

  15. State Transition of Improved Solution • State transition with protection and damping: improving availability and stability. AsiaFI, Student Workshop

  16. BGP Fast Convergence Solutions • Major Problem in BGP • Theoretical analysis and measurement result indicate path exploration of path vector protocol prolongs routing convergence • Several solution addressed this problem: • RCN can eliminate all the obsolete routes and ensure that only valid alternative routes are chosen and propagated by carrying the root-cause information in the BGP updates. • Ghost Flushing improves the BGP convergence by expediting the removal of outdated “ghost” information in the Internet. • Drawbacks … • Network fail-over events in GF, Transient routing problems.… AsiaFI, Student Workshop

  17. Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Requirements of Solution • Routing Protection • Evaluation Metrics • Conclusion and Future Work AsiaFI, Student Workshop

  18. Self-healing Routing • The goal of self-healing routing • After a link or a node is devastated, network can restore or repair routes by itself • Self-healing routing approaches • Routing Restoration (Fast Routing Convergence) Attempt to find a new path on-demand to restore connectivity when a failure occurs. • Routing Protection Based on the fixed and predetermined failure recovery, provide a working path set up for traffic forwarding and an alternate protection path. AsiaFI, Student Workshop

  19. Requirements of Solution • Simplicity • The solution should be simple and not add much complexity in core networks, but MPLS needs a fundamental infrastructure. • Easy Deployment and Management • MPLS-related solution is not a good potential solution because it is hard to pre-compute backup path for every nodes. • Efficiency • Protection should not be deployed to cover 100% of network, especially when multiple failures happen. • Incremental Deployment Support • It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once. AsiaFI, Student Workshop

  20. Requirements of Solution (cont.) • Business model Support • The designed solution should consider the business model of path protection application in production networks. • In order to protect unstable network and backbone network areas, contrasts between different ISPs should be signed to guarantee routing availability in these areas. • Low Cost • The path protection solution should provide routes without many computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss. • The solution should covers protection under both short term or long term network failures. AsiaFI, Student Workshop

  21. Principle of our solution (cont.) • The key idea of routing protection is that it makes tradeoff between the additional cost introduced by tunneling and packet lost caused by failures. • Fast Failure Detection • simplicity, fast detection, easy implementation and no change to existing routing protocols, • Bidirectional Forwarding Detection (BFD) is directly applied. • Path Protection Technique • Although two different types of routing protocol need be considered, intra-domain routing and inter-domain routing tunnel, there is no need for us to provide path protection techniques for different routing instances. • In order to eliminate the problems introduced by L3 tunnel, we choose L2TP as protection technique. AsiaFI, Student Workshop

  22. Principle of our solution (cont.) • TunnelDeactivation • Tunnels should be deactivated if the short term failure recovers or route converges again after a long term failure, e.g. for the view of loop avoidance or performance. In this situation, tunnel inactivation mechanism is essential to guarantee normal data forwarding. LAC: L2tp Access Concentrator LNS: L2TP Network Server AsiaFI, Student Workshop

  23. Evaluation metrics of routing system • Two metrics to evaluate routing system • Availability refers to the ability of routing system to work for normal packet delivery no matter whether network failures happen. • Stability refers to routing dynamic of routing system no matter network failures happen. • Routing paths provided by tunnel guarantee routing availability, while delayed route updates during long-term failures or eliminated route updates during short-term failures improves stability of routing systems. AsiaFI, Student Workshop

  24. Outline • Problems • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future Work AsiaFI, Student Workshop

  25. Conclusion and Future Work • A lot of interesting problems in the Internet • The routing issues in Internet are being addressed actively. • Many of the problems are hard – no easy solutions, have to make tradeoffs. • Our solution well addresses the self-healing problems of routing. • Further study and measurement of our solution • Development of the prototype and Experimental analysis on CERNET2 AsiaFI, Student Workshop

  26. Thanks Q&A liqi@csnet1.cs.tsinghua.edu.cn

More Related