1 / 17

EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud. Vimal. Mohammad Alizadeh Balaji Prabhakar David Mazières. Changhoon Kim Albert Greenberg. What are we depending on?. Many customers don’t even realise network issues:

zytka
Download Presentation

EyeQ : (An engineer’s approac h to ) Taming network performance unpredictability in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EyeQ:(An engineer’s approach to)Taming network performance unpredictabilityin the Cloud Vimal Mohammad Alizadeh BalajiPrabhakar David Mazières Changhoon Kim Albert Greenberg

  2. What are we depending on? Many customers don’t even realise network issues: Just “spin up more VMs!” Makes app more network dep. 5 Lessons We’ve Learned Using AWS … in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html Overhaul appsto deal with variability

  3. Cloud: Warehouse Scale Computer Multi-tenancy: To increase cluster utilisation Provisioning the Warehouse CPU, memory, disk Network http://research.google.com/people/jeff/latency.html

  4. Sharing the Network • Policy • Sharing model • Mechanism • Computing rates • Enforcing rates on entities… • Per-VM (multi-tenant) • Per-service (search, map-reduce, etc.) Can we achieve this? • 2Ghz VCPU • 15GB memory • 1Gbps network Customer X specifies the thickness of each pipe. No traffic matrix. (Hose Model) Tenant X’s Virtual Switch Tenant Y’s Virtual Switch … … VM1 VM2 VM3 VMn VM1 VM2 VM3 VMi

  5. Why is it hard? (1) • Default policy insufficient: 1 vs many TCP flows, UDP, etc. • Poor scalability of traditional QoS mechanisms • Bandwidth demands can be… • Random, bursty • Short: few millisecond requests • Timescales matter! • Need guarantees on the order of few RTTs (ms) 10–100MB 10–100KB

  6. Seconds: Eternity Shared 10G pipe 1 Long lived TCP flow Switch Bursty UDP session ON: 5ms OFF: 15ms

  7. Under the hood Switch

  8. Why is it hard? (2) • Switch sees contention, but lacks VM state • Receiver-host has VM state, but does not see contention (1) Drops in network: servers don’t see true demand (2) Elusive TCP (back-off) makes true demand detection harder Switch

  9. Key Idea: Bandwidth Headroom • Bandwidth guarantees: managing congestion • Congestion: link utilreaches 100% • At millisecond timescales • Don’t allow 100% util • 10% headroom: Early detection at receiver Single Switch: Headroom What about a network? TCP N x 10G Shared pipeLimit to 9G UDP

  10. Network design: the old Over-subscription http://bradhedlund.com/2012/04/30/network-that-doesnt-suck-for-cloud-and-big-data-interop-2012-session-teaser/

  11. Network design: the new (1) Uniform capacity across racks (2) Over-subscription only at Top-of-Rack http://bradhedlund.com/2012/04/30/network-that-doesnt-suck-for-cloud-and-big-data-interop-2012-session-teaser/

  12. Mitigating Congestion in a Network VM VM 10Gbps pipe 10Gbps pipe Load balancing: ECMP, etc. Admissibility: e2e congestion control (EyeQ) Server Server Fabric Fabric Aggregate rate < 10Gbps Congestion free Fabric Aggregate rate > 10Gbps Fabric gets congested Load balancing + Admissibility = Hotspot free network core [VL2, FatTree, Hedera, MicroTE]

  13. EyeQ Platform TX RX VM VM VM VM VM untrusted untrusted Software VSwitch Software VSwitch Adaptive Rate Limiters Congestion Detectors DataCentreFabric TX packets RX packets 3Gbps 6Gbps Congestion Feedback End-to-endflow control (VSwitch—VSwitch) RX componentdetects TX componentreacts

  14. Does it work? TCP: 6Gbps UDP: 3Gbps Improves utilisation Provides protection Without EyeQ With EyeQ

  15. State: only at edge One Big Switch EyeQ

  16. Thanks!jvimal@stanford.edu EyeQ Load balancing + Bandwidth headroom + Admissibility at millisec timescales = Network as one big switch = Bandwidth sharing at edge Linux, Windows implementation for 10Gbps ~1700 lines C code http://github.com/jvimal/perfiso_10g (Linux kmod) No documentation, yet. 

More Related