310 likes | 315 Views
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria
E N D
What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1
This is cloud computing era! • Cloud systems are in rapid development. • Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2
Cloud Bug Study(CBS) • 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600(17%) “vital” issues for in-depth study • vital: affect real deployed systems. 3
Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4
Methodology • Issue Repositories Analysis • Issue Classifications • Cloud Bug Study DB (CBSDB) 5
Issue Reposities • Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository • For example:Zookeeper’s Issue Reposity 6
Example Title Description Time to resolved Discussion Type& Priority 7
Several Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware - processor, disk, memory, network, node. • Hardware failures - Corrupt, limp, stop • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8
Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness • CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 9
Aspects: Performance • Reliability (45%) • Performance (22%) 10
Aspects: Availability • Reliability (45%) • Performance (22%) • Availability(16%) 11
Aspects: Security • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) 12
There’s new aspects in cloud systems • Classical: • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • New: Data consistency, scalability, topology, QoS 13
Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures 14
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Small numbers, but important, hard to test in small-scale 15
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Cross DC, Different racks 16
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Typically in vertical/cross-system QoS. 17
Killer Bugs • bugs that simultaneously affect multiple nodes or even the entire cluster • SPoF still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • Distributed deadlock • … 18
Killer Bugs • The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19
False Failure Positive feedback loop High Load Recovery • Example Case in Cassandra: More False Failure High Gossip Traffic More nodes More 20
Repeated bugs after failover • A key to no-SPoF: after a successful failover, the system should resume previously failed operation • But for software bugs, a failover the system will run the same buggy logic again… • In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. • Eventually, all region servers go offline 21
HW faults and modes • 299 improper handling of node fail-stop failure • A 25% normal speed memory card causes problems in HBase deployment. 23
Software bug types • Logic (29%) • Error handling (18%) • Optimization (15%) • Configuration (14%) • Data Race (12%) • Hang (4%) - Deadlock • Space (4%) • Load (4%) Load Space Hang Race Config Opt Err-h Logic 24
Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%) Corrupt Stale Loss Down Perf Opfail 25
Software/Hardware Faults & Implications Long way from a highly dependable system. Catch all faults! 26
Cloud Bug Study database (CBSDB) • a total of 21,399 issues (3655 vitals) • Open to public • Bug evolution analysis. 27
System evolution Hadoop 2.0 28
Conclude • The largest bug studies for cloud systems to date • Provide insights for a lot of intricate bugs • Unique bugs in cloud systems. • Killer bugs • Cloud Bug Study(CBS) database. 29
Comments • This study includes a huge amount of human effort, not efficient and maintainable. • The study finds out the issues distribution, but do not have any suggestion or solution to them at all. • The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. • CBSDB is not active, involving large amount of maintaining time. • The author did not explicitly mention how are we supposed to use this study for future development. 30
Thoughts and Discussion from Piazza • Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. • They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger • They say it is still far way 100% dependable. • Need an automatic analysing tool - Sanchit Gupta 31