1 / 55

Peregrine : An All-Layer-2 Container Computer Network

Peregrine : An All-Layer-2 Container Computer Network. Tzi-cker Chiueh , Cheng-Chun Tu , Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan ∗Computer Science Department, Stony Brook University †Industrial Technology Research Institute, Taiwan. 1. Outline. Motivation

kimberly
Download Presentation

Peregrine : An All-Layer-2 Container Computer Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peregrine: An All-Layer-2 Container Computer Network Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan ∗Computer Science Department, Stony Brook University †Industrial Technology Research Institute, Taiwan 1

  2. Outline • Motivation • Layer 2 + Layer 3 design • Requirements for cloud-scale DC • Problems of Classic Ethernet to the Cloud • Solutions • Related Solutions • Peregrine’s Solution • Implementation and Evaluation • Software architecture • Performance Evaluation

  3. L2 + L3 Architecture: Problems Bandwidth Bottleneck • Problem: Configuration: • - Routing table in the routers • IP assignment • DHCP coordination • - VLAN and STP Problem: forwarding table size Commodity switch 16-32k Virtual Machine Mobility Constrained to a Physical Location Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System

  4. Requirements for Cloud-Scale DC • Any-to-any connectivity with non-blocking fabric • Scale to more than 10,000 physical nodes • Virtual machine mobility • Large layer 2 domain • Fast fail-over • Quick failure detection and recovery • Support for Multi-tenancy • Share resources between different customers • Load balancing routing • Efficiently use all available links

  5. Solution: A Huge L2 Switch! Layer 2 Switch Single L2 Network Non-blocking Backplane Bandwidth However, Ethernet does not scale! Config-free, Plug-and play Linear cost and power scaling VM VM VM ….. Scale to 1 million VMs VM VM VM VM VM VM VM VM VM 5

  6. Revisit Ethernet: Spanning Tree Topology N4 N3 N5 s2 s1 N6 N2 s3 s4 N8 N1 N7

  7. Revisit Ethernet: Spanning Tree Topology N4 N3 N5 D B s2 D s1 R N6 N2 s3 s4 Root N8 N1 N7

  8. Revisit Ethernet: Broadcast and Source Learning N4 N3 N5 D B s2 D s1 R N6 N2 s3 s4 Root N8 N1 N7 Benefit: plug-and-play

  9. Ethernet’s Scalability Issues Limited forwarding table size Commodity switch: 16k to 64k entries STP as a solution to loop prevention Not all physical links are used No load-sensitive dynamic routing Slow fail-over Fail-over latency is high ( > 5 seconds) Broadcast overhead Typical Layer 2 consists of hundreds of hosts

  10. Related Works / Solution Strategies • Scalability: • Clos network / Fat-tree to scale out • Alternative to STP • Link aggregation, e.g. LACP, Layer 2 trunking • Routing protocols to layer 2 network • Limited forwarding table size • Packet header encapsulation or re-writing • Load balancing • Randomness or traffic engineering approach

  11. Design of the Peregrine

  12. Not all links are used Disable Spanning Tree protocol L2 Loop Prevention Redirect broadcast and block flooding packet Source learning and forwarding calculate all routes for all node-pairs by Route Server Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module Peregrine’s Solutions

  13. ARP Intercept and Redirect Directory Service Route Algorithm Server DS RAS Control flow Data flow 1. DS-ARP 2. DS-Reply sw3 A sw1 B sw4 3. Send data sw2

  14. Not all links are used Disable Spanning Tree protocol L2 Loop Prevention Redirect broadcast and block flooding packet Source learning and forwarding Calculate all routes for all node-pairs by Route Server Fast fail-over: primary and backup routes for each pair Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module Peregrine’s Solutions

  15. Mac-in-Mac Encapsulation DS Control flow Data flow 2. B locates at sw4 IDA IDA DA DA SA SA 3. sw4 sw3 A sw1 B 1. ARP redirect sw4 DA DA SA SA 4. Encap sw4 in source mac 5. Decap and restore original frame sw2 Encapsulation Decapsulation sw 4 B A

  16. Fast Fail-Over • Goal: Fail-over latency < 100 msec • Application agnostic • TCP timeout: 200ms • Strategy: Pre-compute a primary and backup route for each VM • Each VM has two virtual MACs • When a link fails, notify hosts using affected primary routes that they should switch to corresponding backup routes

  17. When a Network Link Fails

  18. IMPLEMENTATION & Evaluation

  19. Software Architecture

  20. Review All Components DS How fast can DS handle request? ARP request rate RAS sw3 How long can RAS process request? MIM module A sw1 ARP redirect B sw4 sw2 sw5 sw7 sw6 Backup route Performance of: MIM, DS, RAS, switch ?

  21. Mac-In-Mac Performance cdf Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU) Around 2.66K / 13.3K / 18.6K cycles

  22. Aggregate Throughput for Multiple VMs 1. APR table size < 1k 2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each other.

  23. ARP Broadcast Rate in a Data Center • What’s the ARP traffic rate in real world? • From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average. • From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average. • To scale to 1M node, 20K-30K ARP/sec on average. • Current optimal DS: 100K ARP/sec

  24. Fail-over time and its breakdown • Average fail-over time: 75ms • Switch: 25 ~ 45 ms • sending trap (soft unplug) • RS: 25ms • receiving trap and processing • DS: 2ms • receiving info from RS and inform DS • The rests are network delay and dom0 processing time

  25. Conclusion • A unified Layer-2-only network for LAN and SAN • Centralized control plane and distributed data plane • Use only Commodity Ethernetswitches • Army of commodity switches vs. few high-port-density switches • Requirements on switches: run fast and has programmable routing table • Centralized load-balancing routing using real-time traffic matrix • Fast fail-over using pre-computed primary/back routes

  26. Questions? Thank you

  27. Review All Components: Result DS 100K ARP/sec ARP request rate RS 25ms per request sw3 7us for Packet processing A ARP redirect sw1 B Link down 35ms sw4 sw2 sw5 sw7 sw6 Backup route 27

  28. Backup slides Thank you

  29. OpenFlow Architecture • OpenFlow switch: A data plane that implements a set of flow rules specified in terms of the OpenFlow instruction set • OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches • OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches

  30. OpenFlow Controller OpenFlow Protocol (SSL/TCP) Control Path OpenFlow Data Path (Hardware)

  31. Conclusion and Contribution • Using commodity switches to build a large scale layer 2 network • Provide solutions to Ethernet’s scalability issues • Suppressing broadcast • Load balancing route calculation • Controlling MAC forwarding table • Scale up to one million VMs by Mac-in-Mac two stage forwarding • Fast fail-over • Future work • High Availability of DS and RAS, mater-slave model • Inter

  32. Comparisons • Scalable and available data center fabrics • IEEE 802.1aq: Shortest Path Bridging • IETF TRILL • Competitors: Cisco, Juniper, Brocade • Differences: commodity switches, centralized load balancing routing and proactive backup route deployment • Network virtualization • OpenStack Quantum API • Competitors: Nicira, NEC • Generality carries a steep performance price • Every virtual network link is a tunnel • Differences: Simpler and more efficient because it runs on L2 switches directly

  33. n n 1 1 1 n n 2 2 2 . . . . . . . . . n n r m r Three Stage Clos Network (m,n,r) m x n n x m r x r

  34. Clos Network Theory • Clos(m, n, r) configuration: rn inputs, rn outputs 2r nxm + m rxr switches, less than rn x rn • Each rxr switch can in turn be implemented as a 3-stage Clos network • Clos(m,n,r) is rearrangeably non-blocking iff m >= n • Clos(m,n,r) is stricly non-blocking iff m >= 2n-1

  35. Link Aggregation

  36. ECMP: Equal-Cost Multipath Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link

  37. Example: Brocade Data Center L3 ECMP Link aggregation Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers

  38. PortLand • Scale-out: Three-layer, multi-root topology • Hierarchical, encode location into MAC address • Local Discover Protocol to find shortest path, route by MAC • Fabric Manager maintains IP to MAC mapping • 60-80 ms failover, centrally control and notify

  39. VL2: Virtual Layer 2 • Three layer, Clos network • Flat, IP-in-IP, Location address(LA) an Application Address (AA) • Link-state routing to disseminate LA • VLB + flow-based ECMP • Depend on ECMP to detect link failure • Packet interception at S VL2 Directory Service

  40. Monsoon • Three layer, multi-root topology • 802.1ah MAC-in-MAC encapsulation, source routing • centralized routing decision • VLB + MAC rotation • Depend on LSA to detect failures • Packet interception at S Monsoon Directory Service IP <-> (server MAC, ToR MAC)

  41. TRILL and SPB TRILL SPB Shortest Path Bridging, IEEE IS-IS as a topology management protocol Shortest path forwarding 802.1ah MAC-in-MAC Compute 16 source node based trees • Transparent Interconnect of Lots of Links, IETF • IS-IS as a topology management protocol • Shortest path forwarding • New TRILL header • Transit hash to select next-hop

  42. TRILL Packet Forwarding Link-state routing TRILL header A-ID: nickname of A C-ID: nickname of C HopC: hop count Ref: NIL Data Communications

  43. SPB Packet Forwarding Link-state routing 802.1ah Mac-in-Mac I-SID: Backbone Service Instance Identifier B-VID: backbone VLAN identifier Ref: NIL Data Communications

  44. kxn nxk (N/n)x(N/n) 2x2 3x3 2x2 N=6 n=2 k=2 2x2 2x2 3x3 2x2 2x2 Re-arrangeable non-blocking Clos network • Example: • Three-stage Clos network • Condition: k>=n • An unused input at ingress switch can always be connected to an unused output at egress switch • Existing calls may have to be rearranged input output ingress egress middle

  45. Features of Peregrine network • Utilize all links • Load balancing routing algorithm • Scale up to 1 million VMs • Two stage dual mode forwarding • Fast fail over • Load balancing routing algorithm 45

  46. Goal Primary S D Backup • Given a mesh network and traffic profile • Load balance the network resource utilization • Prevent congestion by balancing the network load to support as many traffic load as possible • Provide fast recovery from failure • Provide primary-backup route to minimize recovery time

  47. Factors How to combine  them into one number for a particular candidate route? Only hop count Hop count and link residual capacity Hop count, link residual capacity, and link expected load Hop count, link residual capacity, link expected load and additional forwarding table entries required

  48. Route Selection: idea A B Leave C-D free S1 D1 Share with S2-D2 C D S2-D2 shares link C-D S2 D2 Which route is better from S1 to D1? Link C-D is more important! Idea: use it as sparsely as possible

  49. Route Selection: hop count and Residual capacity A B Traffic Matrix: S1 -> D1: 1G S2 -> D2: 1G Leave C-D free S1 D1 Share with S2-D2 C D S2 D2 Using Hop count or residual capacity makes no difference!

  50. Determine Criticality of A Link Determine the importance of a link = fraction of all (s, d) routes that pass through link l Expected load of a link at initial state = Bandwidth demand matrix for s and d

More Related