1 / 75

Yaping Zhu yapingz@cs.princeton Advisor: Prof. Jennifer Rexford Princeton University

Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing. Yaping Zhu yapingz@cs.princeton.edu Advisor: Prof. Jennifer Rexford Princeton University. Minimize Performance Disruptions. Network changes affect user experience Equipment failures Routing changes Network congestion

Download Presentation

Yaping Zhu yapingz@cs.princeton Advisor: Prof. Jennifer Rexford Princeton University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing Yaping Zhu yapingz@cs.princeton.edu Advisor: Prof. Jennifer Rexford Princeton University

  2. Minimize Performance Disruptions • Network changes affect user experience • Equipment failures • Routing changes • Network congestion • Network operators have to react and fix problems • Fix equipment failure • Change route selection • Change server selection

  3. Diagnosis Framework: Enterprise Network Diagnose Measure: network changes Fix: equipment, config, etc. Full Control Full Visibility

  4. Challenges to Minimize Wide-Area Disruptions • The Internet is composed of many networks • ISP (Internet Service Provider): provides connectivity • CDN (Content Distribution Network): provides services • Each network has limited visibility and control Small ISPs Large ISP Client CDN

  5. ISP’s Challenge: Provide Good Transit for Packets • Limited visibility • Small ISP: lack of visibility into problem • Limited control • Large ISP: lack of direct control to fix congestion Small ISPs Large ISP Client CDN

  6. CDN’s Challenge: Maximize Performance for Services • Limited visibility • CDN: can’t figure out exact root cause • Limited control • CDN: lack of direct control to fix problem Small ISPs Large ISP Client CDN

  7. Summary of Challenges of Wide-Area Diagnosis • Measure: large volume and diverse kinds of data • Diagnosis today: ad-hoc • Takes a long time to get back to customers • Does not scale to large number of events Our Goal: Build Systems for Wide-Area Diagnosis Formalize and automate the diagnosis process Analyze a large volume of measurement data

  8. Techniques and Tools for Wide-Area Diagnosis

  9. Rethink Routing Protocol Design • Many performance problems caused by routing • Route selection not based on performance • 42.2% of the large latency increases in a large CDN correlated with inter-domain routing changes • No support for multi-path routing Our Goal: Routing Protocol for Better Performance Fast convergence to reduce disruptions Route selection based on performance Scalable multi-path to avoid disruptions Less complexity for fewer errors

  10. Thesis Outline

  11. Route Oracle: Where Have All the Packets Gone? Work with: Jennifer Rexford Aman Shaikh and Subhabrata Sen AT&T Research

  12. Route Oracle: Where Have All the Packets Gone? • Inputs: • Destination: IP Address • When? Time • Where? Ingress router • Outputs: • Where leaving the network? Egress router • What’s the route to destination? AS path AT&T IP Network IP Packet Egress Router Ingress Router AS Path Destination IP Address

  13. Application: Service-Level Performance Management AT&T CDN Server in Atlanta • Troubleshoot CDN throughput drop • Case provided by AT&T ICDS (Intelligent Content Distribution Service) Project AT&T Leave AT&T in Atlanta Router in Atlanta Leave AT&T in Washington DC Sprint Atlanta users

  14. Background: IP Prefix and Prefix Nesting • IP prefix: IP address / prefix length • E.g. 12.0.0.0 / 8 stands for [12.0.0.0, 12.255.255.255] • Suppose the routing table has routes for prefixes: • 12.0.0.0/8: [12.0.0.0-12.255.255.255] • 12.0.0.0/16: [12.0.0.0-12.0.255.255] • [12.0.0.0-12.0.255.255] covered by both /8 and /16 prefix • Prefix nesting: IPs covered by multiple prefixes • 24.2% IP addresses are covered by more than one prefix

  15. Background: Longest Prefix Match (LPM) • BGP update format • by IP prefix • egress router, AS path • Longest prefix match (LPM): • Routers use LPM to forward IP packets • LPM changes as routes are announced and withdrawn • 13.0% BGP updates cause LPM changes Challenge: determine the route for an IP address -> LPM for the IP address -> track LPM changes for the IP address

  16. Challenge: Scale of the BGP Data • Data collection: BGP Monitor • Have BGP session with each router • Receive incremental updates of best routes • Data Scale • Dozens of routers (one per city) • Each router has many prefixes (~300K) • Each router receives lots of updates (millions per day) Best routes Software Router Centralized Server BGP Routers

  17. Background: BGP is Incremental Protocol • Incremental Protocol • Routes not changed are not updated • How to log routes for incremental protocol? • Routing table dump: daily • Incremental updates: 15mins Daily table dump 15 mins updates Best routes Software Router Centralized Server BGP Routers

  18. Route Oracle: Interfaces and Challenges • Challenges • Track longest prefix match • Scale of the BGP data • Need answer to queries • At scale: for many IP addresses • In real time: for network operation BGP Routing Data Inputs Destination IP Address Ingress Router Time Route Oracle Outputs Egress Router AS Path Yaping Zhu, Princeton University

  19. Strawman Solution: Track LPM Changesby Forwarding Table • How to implement • Run routing software to update forwarding table • Forwarding table answers queries based on LPM • Answer query for one IP address • Suppose: n prefixes in routing table at t1, m updates from t1 to t2 • Time complexity: O(n+m) • Space complexity: • O(P): P stands for #prefixes covering the query IP address

  20. Strawman Solution: Track LPM Changes by Forwarding Table • Answer queries for k IP addresses • Keep all prefixes in forwarding table • Space complexity: O(n) • Time complexity: major steps • Initialize n routes: n*log(n)+k*n • Process m updates: m*log(n)+k*m • In sum: (n+m)*(log(n)+k) • Goal: reduce query processing time • Trade more space for less time: pre-processing • Store pre-processed results: not scale for 232 IPs • need to track LPM scalably

  21. Track LPM Scalably: Address Range • Prefix set • Collection of all matching prefixes for given IP address • Address range • Contiguous addresses that have the same prefix set • E.g. 12.0.0.0/8 and 12.0.0.0/16 in routing table • [12.0.0.0-12.0.255.255] has prefix set {/8, /16} • [12.1.0.0-12.255.255.255] has prefix set {/8} • Benefits of address range • Track LPM scalably • No dependency between different address ranges

  22. Track LPM by Address Range: Data Structure and Algorithm • Tree-based data structure: node stands for address range • Real-time algorithm for incoming updates [12.0.1.0-12.0.255.255] [12.0.0.0-12.0.0.255] [12.1.0.0-12.255.255.255] /8 /16 /24 /8 /16 /8 Routing Table

  23. Track LPM by Address Range: Complexity • Pre-processing: • for n initial routes in the routing table and m updates • Time complexity: (n+m)*log(n) • Space complexity: O(n+m) • Query processing: for k queries • Time complexity: O((n+m)*k) • Parallelization using c processors: O((n+m)*k/c)

  24. Route Oracle: System Implementation BGP Routing Data: Daily table dump, 15 mins updates Precomputation Daily snapshot of routes by address ranges Incremental route updates for address ranges Query Inputs: Destination IP Ingress router Time Query Processing Output for each query: Egress router, AS path

  25. Query Processing: Optimizations • Optimize for multiple queries • Amortize the cost of reading address range records: across multiple queried IP addresses • Parallelization • Observation: address range records could be processed independently • Parallelization on multi-core machine

  26. Performance Evaluation: Pre-processing • Experiment on SMP server • Two quad-core Xeon X5460 Processors • Each CPU: 3.16 GHz and 6 MB cache • 16 GB of RAM • Experiment design • BGP updates received over fixed time-intervals • Compute the pre-processing time for each batch of updates • Can we keep up? pre-processing time • 5 mins updates: ~2 seconds • 20 mins updates: ~5 seconds

  27. Performance Evaluation: Query Processing • Query for one IP (duration: 1 day) • Route Oracle 3-3.5 secs; Strawman approach: minutes • Queries for many IPs: scalability (duration: 1 hour)

  28. Performance Evaluation: Query Parallelization

  29. Conclusion

  30. NetDiag: Diagnosing Wide-Area Latency Changes for CDNs Work with: Jennifer Rexford Benjamin Helsley, Aspi Siganporia, and Sridhar Srinivasan Google Inc.

  31. Background: CDN Architecture • Life of a client request • Front-end (FE) server selection • Latency map • Load balancing (LB) Ingress Router Front-end Server (FE) CDN Network Client AS Path Egress Router

  32. Challenges • Many factors contribute to latency increase • Internal factors • External factors • Separate cause from effect • e.g., FE changes lead to ingress/egress changes • The scale of a large CDN • Hundreds of millions of users, grouped by ISP/Geo • Clients served at multiple FEs • Clients traverse multiple ingress/egress routers

  33. Contributions • Classification: • Separating cause from effect • Identify threshold for classification • Metrics: analyze over sets of servers and routers • Metrics for each potential cause • Metrics by an individual router or server • Characterization: • Events of latency increases in Google’s CDN (06/2010)

  34. Background: Client Performance Data Ingress Router Performance Data Front-end Server (FE) CDN Network Client AS Path Egress Router Performance Data Format: IP prefix, FE, Requests Per Day (RPD), Round-Trip Time (RTT)

  35. Background: BGP Routing and Netflow Traffic • Netflow traffic (at edge routers): 15 mins by prefix • Incoming traffic: ingress router, FE, bytes-in • Outgoing traffic: egress router, FE, bytes-out • BGP routing (at edge routers): 15 mins by prefix • Egress router and AS path

  36. Background: Joint Data Set • Granularity • Daily • By IP prefix • Format • FE, requests per day (RPD), round-trip time (RTT) • List of {ingress router, bytes-in} • List of {egress router, AS path, bytes-out} BGP Routing Data Netflow Traffic Data Performance Data Joint Data Set

  37. Classification of Latency Increases Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path

  38. Case Study: Flash Crowd Leads some Requests to a Distant Front-End Server • Identify event: RTT doubled for an ISP in Malaysia • Diagnose: follow the decision tree Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing 97.9% by FE changes 32.3% FE change By load balancing FE Change vs. FE Latency Increase RPD (requests per day) jumped: RPD2/RPD1 = 2.5

  39. Classification: FE Server and Latency Metrics Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path

  40. FE Change vs. FE Latency Increase • RTT: weighted by requests from FEs • Break down RTT change by two factors • FE change • Clients switch from one FE to another (with higher RTT) • FE latency change • Clients using the same FE, latency to FE increases

  41. FE Change vs. FE Latency Change Breakdown • FE change • FE latency change • Important properties • Analysis over a set of FEs • Sum up to 1

  42. FE Changes: Latency Map vs. Load Balancing Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path

  43. FE Changes: Latency Map vs. Load Balancing • Classify FE changes by two metrics: • Fraction of traffic shift by latency map • Fraction of traffic shift by load balancing Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing FE Changes FE Change vs. FE Latency Increase

  44. Latency Map: Closest FE Server • Calculate latency map • Latency map format: (prefix, closest FE) • Aggregate by groups of clients: list of (FEi, ri) ri: fraction of requests directed to FEi by latency map • Define latency map metric

  45. Load Balancing: Avoiding Busy Servers • FE request distribution change • Fraction of requests shifted by the load balancer • Sum only if positive: target request load > actual load • Metric: more traffic load balanced on day 2

  46. FE Latency Increase: Routing Changes • Correlate with routing changes: • Fraction of traffic shifted ingress router • Fraction of traffic shifted egress router, AS path FE hange vs. FE Latency Increase BGP Routing Netflow Traffic FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path

  47. Routing Changes: Ingress, Egress, AS Path • Identify the FE with largest impact • Calculate fraction of traffic which shifted routes • Ingress router: • f1j, f2j: fraction of traffic entering ingress j on days 1 and 2 • Egress router and AS path • g1k, g2k: fraction of traffic leaving egress/AS path k on day 1, 2

  48. Identify Significant Performance Disruptions Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path

  49. Identify Significant Performance Disruptions • Focus on large events • Large increases: >= 100 msec, or doubles • Many clients: for an entire region (country/ISP) • Sustained period: for an entire day • Characterize latency changes • Calculate daily latency changes by region

  50. Latency Characterization for Google’s CDN • Apply the classification to one month of data (06/2010)

More Related