Masking Failures from Application Performance in Data Center Networks with Shareable Backup

Masking Failures from Application Performance in Data Center Networks with Shareable Backup DingmingWu+,Yiting Xia+*, XiaoyeStevenSun+, XinSunny Huang+,SimbarasheDzinamarira+, T. S. Eugene Ng+ +Rice University, *Facebook, Inc.

Data Center Network Should be Reliable but…

NetworkFailuresareDisruptive • Median case of failures: 10% less traffic delivered • Worst 20% of failures: 40% less traffic delivered Gill et al. SIGCOMM 2011

Today’sFailureHandling---Rerouting • Fast local rerouting  inflatedpathlength • Global optimal rerouting  highlatencyofroutesupdates • Impact flows not traveling trough the failure location

Impact on Coflow Completion Time (CCT) • Facebookcoflowtrace • k=16Fat-treenetwork • Globaloptimal rerouting

DoWeHaveOther Options? • Restores network capacity immediately after failure • Be cost efficient • --Small pool of backup switch • How do we achieve that?

Circuit Switches • Physicallayerdevice • Circuitcontrolledbysoftware C A • Examples • --optical 2D-MEMS switch, 40us, $10 per-port cost • --electrical cross-point switch, 70ns, $3 per-port cost B D

IdealArchitecture Circuit Switch … … … BackupSwitch Servers Regularswitches • Entirenetworksharesonebackupswitch • Unreasonablehighport-countofcircuitswitch • Replaceanyfailedswitchwhennecessary • Singlepointoffailure

How to Make It Practical • Feasibility • -small port-count circuit switches • Scalability • -partition network into failure groups • -distribute circuit switches across the network • Low cost • -small backup pool • -share backup switches per failure groups

ShareBackupArchitecture AnoriginalFat-treewith k=6 • Partitiontheswitchesintofailuregroups;eachwithk/2switches. Corelayer • Addbackupswitchesperfailuregroups Agg.layer Edgelayer

EdgeLayer Edge switches Backup Switch 0 1 2 Circuit switches 1 0 2 0 2 1 Servers i

AggregationLayer Backup switch Agg. switches 0 1 2 ? 1 0 2 1 2 Circuit switches 0 1 2 0 1 2 0 ? Edge switches Backup switch 0 1 2

Core Layer Core switches 0 3 6 1 4 7 2 5 8 Circuit switches Aggregation switches Backup switch 0 1 2 0 1 2 0 1 2

Recover First, Diagnose Later • FailureRecovery • --switchfailurereplacedbybackupsviacircuitreconfiguration • --linkfailureswitchesonbothsidearereplaced • Automatic failure diagnosis performed offline • -details in the paper

Live Impersonation of Failed Switch Backup switch Edge switches 0 1 2 Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 Servers

Live Impersonation of Failed Switch Backup switch Edge switches 0 1 2 Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 0 Servers

Live Impersonation of Failed Switch Edge switches Backup switch 0 1 2 Routing Table of Every Edge Switch Routing Table 0 VLAN 0 Routing Table 1 VLAN 1 Routing Table 2 VLAN 2 0 Servers

What does control system do? • Collects keep-alive messages & link status reports from switches • Reconfigures circuit switches under failures • Performs offline failure diagnosis • Implications • -needs to talk to many circuit switches and packet switches • -keeps a large amount of states of circuit/switch/link status

DistributedControl System • Onecontrollerforafailuregroupofk/2switches --configuresthecircuitswitchesadjacent toswitchesinthegroup • Maintainsonlylocalcircuitconfigurationsinitsgroup • --doesnotsharestateswithothercontrollers • Talkstocircuitswitchesusinganout-of-bandcontrolnetwork

Summary • FastFailureRecovery • --asfastastheunderlyingcircuitswitchingtechnology • LiveImpersonation • --Traffic is redirected to the backups in physical layer • --Switchesinafailuregrouphavesameroutingtables,useVLANidfordifferentiation • --Regular switches recovered from failures become backup switchesthemselves Fastfailurerecovery,nopathdilation,noroutingdisturbance

Evaluation • Bandwidth Advantage • --Iperf throughput on testbed • Application performance • --MapReduce job completion time

Bandwidth Advantage • 4racks,8 servers,12switches • 8 iPerf flows saturate the network core ShareBackup restores network to full capacity regardlessoffailurelocations

Application Performance 1.2X MapReduce Sort w/ 100GB input data 4.2X ShareBackup preservesapplicationperformanceunderfailures!

ExtraCost • Smallport-countcircuitswitches---veryinexpensive • --e.g.$3per-portcostforcross-pointswitches • Smallbackupswitchpool • --1backupperfailuregroupisusuallyenough • --k = 48 fat-tree with 27648 servers ~6.7%extranetworkcost • Partialdeployment • --failuresmoredestructiveatedgelayer • --employbackuponlyforToRfailures

Conclusion • ShareBackup:anarchitecturalsolutionforfailurerecoveryinDCNs • --usescircuitswitchingforfastfailover • --is aneconomicalapproachofusingbackupsinnetworks • --preservesapplicationperformanceunderfailures • Keytakeaways: • --reroutingisnotthe only approach forfailurerecovery • --fast,transparentfailurerecoveryispossiblethroughcarefulbackupplacements&fastcircuitswitching

Backup---ControlSystemFailures • Circuitswitchsoftwarefailure/controlchannelfailure • --circuitswitchesbecomeunresponsive • --keepexistingcircuitconfigurations,dataplaneisnotimpacted • --fallbacktorerouting • Hardware/powerfailure • --controllerwillreceivelotsfailurereportsinashorttime • --callforhumanintervention • Controllerfailure • --state replication on shadow controllers

Backup---Offline Failure Diagnosis 0 0 0 Aggregation switch ? ? • Recycle healthy switch - Only one switch has failed - Back to normal after reboot • Chain up circuit switches using side ports Circuit switches 0 0 0 ? ? Edge switches 17

Backup---Offline Failure Diagnosis 0 0 0 Aggregation switch Circuit switches 0 0 0 Edge switches 18

Masking Failures from Application Performance in Data Center Networks with Shareable Backup

Masking Failures from Application Performance in Data Center Networks with Shareable Backup

Presentation Transcript

Data Center Networks for the Application

Demystifying and Controlling the Performance of Data Center Networks

Data Center Networks

Scalable Label Assignment in Data Center Networks

Cascading Failures in Infrastructure Networks

TCP Incast in Data Center Networks

Performance Diagnosis and Improvement in Data Center Networks

ElasticTree : Saving Energy in Data Center Networks

zUpdate : Updating Data Center Networks with Zero Loss

ElasticTree : Saving Energy in Data Center Networks

Big Data Challenges in Application Performance Management

Chartis Path Inference in Data Center Networks

Performance Diagnosis and Improvement in Data Center Networks

Best Practices in Application Data Masking

Data Protection With Dynamic Data Masking

Apple Mail Backup Application from InventPure

OS X Mail.App Backup Application from InventPure

iPhone Backup Extractor-Extract Data from iPhone Backup

Data Backup

DATA PROTECTOR BACKUP PERFORMANCE WITH TAPE DRIVES

Data Center Backup Power Solutions

Safeguard Your Clients' Data with Data Masking in MioSalon