1 / 29

Validating Datacenters at Scale

Validating Datacenters at Scale. Karthick Jayaraman Nikolaj Bjørner , Jitu Padhye, Amar Agrawal, Ashish Bhargava, Paul-Andre C Bissonnette, Shane Foster, Andrew Helwer, Mark Kasten, Ivan Lee, Anup Namdhari, Haseeb Niaz,

lgerman
Download Presentation

Validating Datacenters at Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Validating Datacenters at Scale Karthick Jayaraman Nikolaj Bjørner, Jitu Padhye, Amar Agrawal, Ashish Bhargava, Paul-Andre C Bissonnette, Shane Foster, Andrew Helwer, Mark Kasten, Ivan Lee, Anup Namdhari, Haseeb Niaz, Aniruddha Parkhi, Hanukumar Pinnamraju, Adrian Power, Neha Milind Raje, Parag Sharma Microsoft Azure Networking

  2. Hyperscale Azure Datacenter Network 54 regions worldwide 140 countries network devices maintenance changes/day servers policies

  3. Reliablity at Hyperscale Is the network operating as expected? Will my change affect the network?

  4. Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?

  5. Forwarding Information Base (FIB) i1 i2 Collectively determine forwarding behavior of the network • Determines forwarding behavior of each device • Longest prefix matching i3 i4 dstIp=100.26.0.1 dstIp=100.25.0.1

  6. Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?

  7. What is the intent? • All Pairs ToR Reachability R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1

  8. What is the intent? • All Pairs ToR Reachability • Traffic must follow shortest path • Intra-cluster path length = 2 • Intra-datacenter path length = 4 R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1

  9. What is the intent? • All Pairs ToR Reachability • Traffic must follow shortest path • All Equal Cost Multi Paths (ECMP) must be available R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1

  10. Where does the intent come from? Network Graph Service Automatic Intent Extraction R1 R2 R3 R4 • All pairs ToR reachability • Traffic must follow shortest path • ECMP redundancy Topology D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16

  11. Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?

  12. Challenges Anteater [Mai 2011] HSA [Kazemian 2012] Veriflow [Kurshid 2013] NetKat [Anderson 2014] NoD [Lopes 2015] Symmetries [Plotkin 2016, Beckett 2018] All pairs ToR reachability analysis is O(N3) Composite FIB snapshot is a hard engineering problem Libra [Zeng 2014]

  13. Local Validation Exploit Azure network’s regular structure • Each router has a fixed role for a set of addresses • Enough to verify role is enforced on each router Decompose into local contracts R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16

  14. What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router ToR1 Contracts Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 Default contacts Specific contacts ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16

  15. What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router A1 Contracts Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16

  16. What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16

  17. Live Monitoring of Forwarding Behavior R1 R2 R3 R4 Network Graph Service D1 D2 D3 D4 Validation time for one datacenter < 3 minutes Reachability invariants A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 Error Reports • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16

  18. Realtime Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?

  19. Latent Error R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 13.0.0.0/16 12.0.0.0/16 11.0.0.0/16 • 10.0.0.0/16

  20. Latent Errors R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 13.0.0.0/16 12.0.0.0/16 11.0.0.0/16 • 10.0.0.0/16

  21. What did we do about the errors? O(100) • Risk Categorization • Role of device • No of additional faults required to cause an impact

  22. Experience: Types of Errors Software bugs Hardware failures Operational Drift Migrations Software bug that caused rib-fib inconsistency Operationally down links BGP Sessions that are shut Port channels not configured on T1s Two T1 sets configured with the same ASN

  23. Reliablity at Hyperscale Is the network operating as expected? Will my change affect the network?

  24. Verifying Device Access-Control Lists (ACL) srcIpdstIp protocol action Contracts * 100.64.0.0/16 UDP deny * * * permit Parsers * * * deny Policy bit-vector logic formulas Z3: Check SecGuru

  25. Refactoring a Large Legacy ACL Edge ACL Edge ACL Refactor Few hundred lines Move out service specific protections Several thousands lines Intent was poorly understood Difficult to make changes

  26. Refactoring a Large Legacy ACL Regression contracts Regression contracts Regression contracts SecGuru SecGuru SecGuru Fix errors in policy Deploy refactored ACL Deploy refactored ACL Contract expects: Policy only allows:

  27. Refactoring a Large Legacy ACL

  28. Summary • Captured and checked intent in Azure Datacenters • Incorporated verification to monitor drift and check impact of changes. • Optimized for hyper scale

  29. More Challenges • Wide area networks • Better abstractions for intent • Model-based testing of device firmware • Verifying virtual network policies • Contact • dmaltz@microsoft.com • karjay@microsoft.com • padhye@microsoft.com

More Related