What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1

This is cloud computing era! • Cloud systems are in rapid development. • Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2

Cloud Bug Study(CBS) • 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600(17%) “vital” issues for in-depth study • vital: affect real deployed systems. 3

Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4

Methodology • Issue Repositories Analysis • Issue Classifications • Cloud Bug Study DB (CBSDB) 5

Issue Reposities • Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository • For example:Zookeeper’s Issue Reposity 6

Example Title Description Time to resolved Discussion Type& Priority 7

Several Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware - processor, disk, memory, network, node. • Hardware failures - Corrupt, limp, stop • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8

Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness • CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 9

Aspects: Performance • Reliability (45%) • Performance (22%) 10

Aspects: Availability • Reliability (45%) • Performance (22%) • Availability(16%) 11

Aspects: Security • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) 12

There’s new aspects in cloud systems • Classical: • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • New: Data consistency, scalability, topology, QoS 13

Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures 14

Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Small numbers, but important, hard to test in small-scale 15

Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Cross DC, Different racks 16

Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Typically in vertical/cross-system QoS. 17

Killer Bugs • bugs that simultaneously affect multiple nodes or even the entire cluster • SPoF still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • Distributed deadlock • … 18

Killer Bugs • The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19

False Failure Positive feedback loop High Load Recovery • Example Case in Cassandra: More False Failure High Gossip Traffic More nodes More 20

Repeated bugs after failover • A key to no-SPoF: after a successful failover, the system should resume previously failed operation • But for software bugs, a failover the system will run the same buggy logic again… • In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. • Eventually, all region servers go offline 21

HW faults vs. SW faults 22

HW faults and modes • 299 improper handling of node fail-stop failure • A 25% normal speed memory card causes problems in HBase deployment. 23

Software bug types • Logic (29%) • Error handling (18%) • Optimization (15%) • Configuration (14%) • Data Race (12%) • Hang (4%) - Deadlock • Space (4%) • Load (4%) Load Space Hang Race Config Opt Err-h Logic 24

Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%) Corrupt Stale Loss Down Perf Opfail 25

Software/Hardware Faults & Implications Long way from a highly dependable system. Catch all faults! 26

Cloud Bug Study database (CBSDB) • a total of 21,399 issues (3655 vitals) • Open to public • Bug evolution analysis. 27

System evolution Hadoop 2.0 28

Conclude • The largest bug studies for cloud systems to date • Provide insights for a lot of intricate bugs • Unique bugs in cloud systems. • Killer bugs • Cloud Bug Study(CBS) database. 29

Comments • This study includes a huge amount of human effort, not efficient and maintainable. • The study finds out the issues distribution, but do not have any suggestion or solution to them at all. • The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. • CBSDB is not active, involving large amount of maintaining time. • The author did not explicitly mention how are we supposed to use this study for future development. 30

Thoughts and Discussion from Piazza • Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. • They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger • They say it is still far way 100% dependable. • Need an automatic analysing tool - Sanchit Gupta 31

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

Presentation Transcript

A Study of In-Cloud and Cloud-to-Ground Lightning in Tornado-Bearing Supercells in the Midwest

Cloud Presentation

Security Issues in Cloud Computing

In the Cloud Security

Security Issues in Cloud Computing

What is new in the cloud

Security Issues in Cloud Computing

Identities in the Cloud

A day in the cloud

I’m in the Cloud, Now What?

Insurance in the Cloud

Libraries in the Cloud

AUTHENTICATION IN the CLOUD

Identity in the Cloud

Green Cloud: Reducing Energy Consumption in Cloud systems

In the Cloud…

HPC in the Cloud

Cloud Validation: The issues

A Cloud in a Nutshell

Cloud Presentation

Build Systems of Systems in the Cloud: Tutorial

The State of Cloud Computing in Distributed Systems