1 / 29

Internet Measurement for Self-Driving Networks

Learn about the importance of internet measurement in self-driving networks and how it can improve availability, latency, and traffic engineering. Explore the challenges and future directions in this field.

sburgin
Download Presentation

Internet Measurement for Self-Driving Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Measurement for Self-Driving Networks Matt Calder Minerva Chen, Jose Nunez de Caceres Estrada, Diego Perez Botero, Madhura Phadke, Manuel Schröder  April 4, 2019

  2. Background • Azure Frontdoor – Microsoft’s content delivery network • 1st and (recently) 3rd party CDN • Servers deployed on Microsoft's network edge • Global application load balancing • Reverse proxy / Split TCP • Dedicated Internet measurement team • Systems used across Microsoft • Network experimentation • Network monitoring • User-facing analytics • Build out and capacity planning • Traffic engineering

  3. We need Self-driving Networks • Use cases • Availability drop / outages.  • Seattle Comcast users failure rate of 10% over the last 10 minutes. • Latency regression • P90 RTT of users in Taiwan increased 100 ms. • Frequent and unending • Operate self-driving networks to mitigate  • Traffic engineering • Building blocks to enable self-driving networks took years

  4. Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions

  5. Leading up to Self-driving Networks #1 No insight • Customer reported incidents • Can't or unaware of how to measure it • Ad-hoc measurement and analysis • Get info from customer • TCP dump on production machine, copy trace locally, write script to analyze • Traceroute from prod machine. Have customer send you traceroute from their end or find looking glass. • Time consuming investigations for engineers • Best outcome is troubleshooting guide

  6. Leading up to Self-driving Networks #1 No insight #2 Automate Data Collection Data is there when you need it • Measurements • Telemetry • Large geo-distributed service • Data ingestion latency • Raw or aggregate • Queryability

  7. Leading up to Self-Driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection • Methodology • Invest in statistics, data quality, and validation • Translate networking domain knowledge into process • Schedule recurring jobs to look for issues • Produce reports 

  8. Leading up to Self-Driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection #4 Alerting True test of #3 Raise alert to on-call engineer Follow troubleshooting guide Too many alerts -> on-call burnout Issues are mostly short-lived

  9. Arriving at Self-driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection #4 Alerting #5 Closing the loop • Feed data to traffic engineering system • Auth DNS, BGP  • Missing piece is measurements of alternate paths • Examples • Change egress traffic links • Change ingress traffic PoPs

  10. Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions

  11. Data Collection at Azure Frontdoor Passive Server-side Client requests instrumented at server Collect TCP and application layer metrics

  12. Data Collection at Azure Frontdoor Passive Server-side Active Server-side • Traceroute, ping • From servers to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics

  13. Data Collection at Azure Frontdoor Passive Server-side Active Server-side • Traceroute, ping • From servers to Internet destinations Active Client-side  • + HTTP(S) • From Microsoft users to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics

  14. Data Collection at Azure Frontdoor Azure Global Telemetry  Passive Server-side Data Access Real-time Active Server-side Near Real-time • Traceroute, ping • From servers to Internet destinations Offline Active Client-side  • + HTTP(S) • From Microsoft users to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics

  15. Measurement Limitations Passive server-side Issue 1: No explicit outage signal Issue 2: Alternate path exploration adds risk

  16. Measurement Limitations Active layer 3 measurements from servers • Issue 1: Poor coverage • 74% of end-users are unresponsive • Issue 2: Missing layer 7 behaviors • HTTP redirection • SSL/TLS Passive server-side Issue 1: No explicit outage signal Issue 2: Alternate path exploration adds risk

  17. Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions

  18. Odin Design HTTP(S) GET tiny.png Server A Odin 20 ms A: 20ms 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting Microsoft

  19. Odin Design HTTP(S) Odin GET tiny.png 20 ms Server B Stock Ticker Desktop User B: 20ms 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting 4. Both Web and Rich Clients Microsoft

  20. Odin Design HTTP(S) Odin GET tiny.png Server B Stock Ticker Desktop User B: ERROR 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting 4. Both Web and Rich Clients Microsoft 5. Explicit Failure Notification

  21. Odin Design Examples showed measurements to the application server Mail Server B Odin Want richer measurements Offline Analysis Report Upload Endpoint Online Alerting Microsoft

  22. Odin Design Orchestration Service Odin Offline Analysis Report Upload Endpoint Online Alerting Microsoft Target URLs Primary and backup report endpoints

  23. Odin Design Odin Server B 1. Orchestration Service m1.contoso.com: 20ms 20ms GET tiny.png Offline Analysis 3. 2. Report Endpoint Online Alerting Microsoft Europe Microsoft U.S. m3.contoso.com

  24. Odin Design: Fault tolerance Odin GET tiny.gif Server B Orchestration Service Offline Analysis B: ERROR Report Endpoint Need to receive measurements even if Microsoft’s network is unavailable Online Alerting Microsoft

  25. Odin Design: Fault tolerance Odin GET tiny.gif Server B Orchestration Service Offline Analysis B: ERROR Report Endpoint Online Alerting Report Proxy Microsoft 3rd Party Network

  26. Summary: Odin enables Self-driving Networks • Coverage • No better vantage points than your actual customers • Safety • Don’t need to experiment/measure with prod traffic • Ability to validate • Flexibility • Supports enterprise network and privacy requirements • Fault tolerance • Measurements available during outages

  27. Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions

  28. Challenges and Future Directions • Data Quality • How do we build systems which avoid making bad decisions? • When do we get humans back in the loop? • TE keeps fixing recurring problems • May require change in service, additional capacity • Need for collaboration • Issues impacting common resources e.g. IXPs, transit, end-users • Self-driving networks will route accordingly • Want to help fix underlying issue • Still need to email NOC • Signals published by content providers • Network operators subscribe

  29. Thanks!

More Related