1 / 23

Traffic Engineering with Forward Fault Correction (FFC)

Traffic Engineering with Forward Fault Correction (FFC). Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University). Cloud services require large network capacity. Cloud Services. Growing traffic. Cloud Networks. Expensive

Download Presentation

Traffic Engineering with Forward Fault Correction (FFC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Traffic Engineering withForward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

  2. Cloud services require large network capacity Cloud Services Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year)

  3. TE is critical to effectively utilizing networks Traffic Engineering • Devoflow • MicroTE • …… • Microsoft SWAN • Google B4 • …… WAN Network Datacenter Network 3

  4. But, TE is also vulnerable to faults TE controller Network view Network configuration Frequent updates for high utilization Control-plane faults Data-plane faults Network

  5. Control plan faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage

  6. Congestion due to control plane faults s2 s4 (10) New Flows (traffic demands): s1 s2 (10) s1 s3 (10) s1 s4 (10) s3 s4 (10)

  7. Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s3 s4 (10)

  8. Control and data plane faults in practice • In production networks: • Faults are common. • Faults cause severe congestion. Control plane: fault rate = 0.1% -- 1% per TE update. Data plane: fault rate = 25% per 5 minutes.

  9. State of the art for handling faults • Heavy over-provisioning • Reactive handling of faults • Control plane faults: retry • Data plane faults: re-compute TE and update networks Big loss in throughput Cannot prevent congestion Slow (seconds -- minutes) Blocked by control plane faults

  10. How about handling congestion proactively?

  11. Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to karbitrary faults. Network faults with careful traffic distribution

  12. Example: FFC for control plane faults Non-FFC

  13. Example: FFC for control plane faults Control Plane FFC (k=1)

  14. Trade-off: network utilization vs. robustness K=1 (Control Plane FFC) K=2 (Control Plane FFC) Non-FFC Throughput: 44 Throughput: 47 Throughput: 50

  15. Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently?

  16. Basic TE linear programming formulations LP formulations Sizes of flows TE decisions: Traffic on paths TE objective: Maximizing throughput max. s.t. Deliver all granted flows No overloaded link Basic TE constraints: … … FFC constraints: control plane faults No overloaded link up to link failures switch failures

  17. Formulating control plane FFC Total load under faults? ’s load in old TE ’s load in new TE Fault on : Fault on : Fault on : Challenge: too many constraints With n flows and FFC protection k: #constraints = for each link.

  18. An efficient and precise solution to FFC Our approach: A lossless compression from O()constraints to O(kn) constraints. Total load under faults link capacity Total additional load due to faults link spare capacity : additional load due to fault-i Given, FFC requires that the sum of arbitrary k elements in is link spare capacity O() Defineas themthlargestelement in : link spare capacity O() Expressing with ?

  19. Sorting network (1st largest) (2nd largest) A comparison =max{} =min{} 1st round 2nd round + link spare capacity • Complexity: O(kn) additional variables and constraints. • Throughput: optimal in control-plane and data plane if paths are disjoint.

  20. FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) • …

  21. Evaluation overview • Testbed experiment • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation A WAN network with O(100) switches and O(1000) links Injecting faults based on real failure reports Multiple priority traffic in a well-utilized network Single priority traffic in a well-provisioned network

  22. FFC prevents congestion with negligible throughput loss Single priority 160% High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) 100 80 60 Ratio (%) 40 20 <0.01% 0 FFC Throughput / Optimal Throughput FFC Data-loss / Non-FFC Data-loss

  23. Conclusions • Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults. • FFC proactively handle these faults. • Guarantee: no congestion when #faults ≤ k. • Efficiently computable with low throughput overhead in practice. FFC Heavy network over-provisioning High risk of congestion

More Related