Why does the Cloud stop computing?
Why does the Cloud stop computing?. Lessons from hundreds of service outages. Haryadi S. Gunawi , Mingzhe Hao , Riza O. Suminto , Agung Laksono , Anang D. Satria , Jeffrey Adityatama , and Kurnia J. Eliazar. Outages. Bugs. 2 years ago @ SoCC ’ 14
Why does the Cloud stop computing?
E N D
Presentation Transcript
Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, MingzheHao, Riza O. Suminto, AgungLaksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
COS @ SoCC '16 Outages Bugs 2 years ago @ SoCC’14 Study of bugs in datacenter distributed systems(Hadoop, HBase, etc.)
COS @ SoCC '16 Public reports! • Headline news and post-mortem reports • Providers’ transparency • Untapped information • Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance
COS @ SoCC '16 COS: Cloud Outage Study • 32 services • 597outages • between 2009-2015 • ~70% report downtimes • ~60% report root causes ?
COS @ SoCC '16 Downtime/year • On average • 6% services do not reach 99% availability (>88 hours) • 78% not reach 99.9% (>8.8 hours) • Worst year • 31% not reach 99% • 81% not reach 99.9% • 5-nine availability? • It’s just a dream? Hours
COS @ SoCC '16 Root causes(sorted by count)
COS @ SoCC '16 Interesting Root Causes • Upgrade • Involves multi-layers • “a code push behaved differently in widespread use than it had during testing” • To understand/reproduce, need full ecosystem
COS @ SoCC '16 Interesting Root Causes • Human mistakes • Rare now (vs. 10 years ago) • Config/Upgrade software bugs • Bugs in automation process • Similar issues? • But root cause origins are different
COS @ SoCC '16 Config vs. Upgrade Research • Upgrade #1, need more research? • Paper count in last few years • Challenges: • Multi-layer • Full ecosystem needed • Multi-year? • Reproducible bugs from industry (benchmarks)?
COS @ SoCC '16 Interesting Root Causes • Bugs • What types of bugs lead to outages? Why are not masked? • (pls. see paper) • “Cascading” bugs
COS @ SoCC '16 • “DynamoDB Storage servers query the metadata service for their membership” • “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” • “As a result, the storage servers were unable to obtain their membership data, and removedthemselves from taking requests” Storage servers Remove self Timeout Busy Metadata service
COS @ SoCC '16 • “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” • “data collection servers … had a failure” • “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers… • “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Memory leak Failure Data collection servers
COS @ SoCC '16 (more in the paper)
COS @ SoCC '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?
COS @ SoCC '16 Failure recovery chain Failure Detection Failover Backups
COS @ SoCC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Incomplete error/failure detection • Undetected (specific type of) memory leaks • Load spikes of authentication requests • “an unexpected hardware behavior” Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Failover/recovery that fails • Bad PLC fails to activate backup power generators • Failed network switch failover • DC failover fails due to cold cache problems • Recovery/re-mirroring storm Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Multiple failures! • Doublefailures of power, network, storage or server components • Diversefailures: network+server; storage+fibre cut • Cascading bugs … • … that caused many/all redundancies to fail Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 COS Database: • Email us / Check our website • More correlations between … • Root cause & downtime • Service maturity & downtime • Root cause & impacts • Root cause & fixes • Etc. ?
COS @ SoCC '16 Conclusion • Features and failures are racing with each other • “Biggest/worst cloud outages of 20YY” – a new year’s tradition • Hope COS tells the cause • Many more examples/details in the papers
COS @ SoCC '16 Thank you!Questions? ceres.cs.uchicago.edu ucare.cs.uchicago.edu
COS @ SoCC '16 Manually extract outage “metadata” Classifications:
COS @ SoCC '16 Aservice outageimplies an unplanned unavailability of partial or fullfeatures of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.
COS @ SoCC '16 #Outages/year • On average • 1/3 of the services, at least 3 unplanned outages per year • Worst Year • (between ’09-’14) • ½ of the services, at least 4unplanned outages per year
COS @ SoCC '16 Downtime by root cause • (sorted by median downtime)
COS @ SoCC '16 Maturity helps? • Does service maturity help? • Based on outage count: • In 2014, 24outages occurred from 9-yr old services
COS @ SoCC '16 Maturity helps? • Based on downtime: • In 2014, 267hours of downtime from 17-yr old services • More mature more popular more users more complex
COS @ SoCC '16 Interesting Root Causes • Load • Spikes of non-monitored requests • User requests (monitored) • Database index accesses • Authentication requests (cryptographic consumption) • Misconfiguration • Ex: traffic redirection • Take-away: be careful with traffic-related code/configs • Recovery feedback loop
COS @ SoCC '16 Interesting Root Causes • Cross (dependencies) • Amazon Web Services • Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine • Azure • Xbox Live and “52 other services” • Google DC (co-location) • Google Gmail, Search, Drive, Youtube • (40% drop of internet traffic for 5 mins)
COS @ SoCC '16 Studies of failures, enough?
COS @ SoCC '16 Studies of failures, enough? Not all report “d”owntimes Most study only a few services (data behind company walls)