1 / 31

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

boldenr
Download Presentation

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1

  2. This is cloud computing era! • Cloud systems are in rapid development. • Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2

  3. Cloud Bug Study(CBS) • 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600(17%) “vital” issues for in-depth study • vital: affect real deployed systems. 3

  4. Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4

  5. Methodology • Issue Repositories Analysis • Issue Classifications • Cloud Bug Study DB (CBSDB) 5

  6. Issue Reposities • Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository • For example:Zookeeper’s Issue Reposity 6

  7. Example Title Description Time to resolved Discussion Type& Priority 7

  8. Several Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware - processor, disk, memory, network, node. • Hardware failures - Corrupt, limp, stop • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8

  9. Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness • CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 9

  10. Aspects: Performance • Reliability (45%) • Performance (22%) 10

  11. Aspects: Availability • Reliability (45%) • Performance (22%) • Availability(16%) 11

  12. Aspects: Security • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) 12

  13. There’s new aspects in cloud systems • Classical: • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • New: Data consistency, scalability, topology, QoS 13

  14. Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures 14

  15. Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Small numbers, but important, hard to test in small-scale 15

  16. Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Cross DC, Different racks 16

  17. Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Typically in vertical/cross-system QoS. 17

  18. Killer Bugs • bugs that simultaneously affect multiple nodes or even the entire cluster • SPoF still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • Distributed deadlock • … 18

  19. Killer Bugs • The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19

  20. False Failure Positive feedback loop High Load Recovery • Example Case in Cassandra: More False Failure High Gossip Traffic More nodes More 20

  21. Repeated bugs after failover • A key to no-SPoF: after a successful failover, the system should resume previously failed operation • But for software bugs, a failover the system will run the same buggy logic again… • In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. • Eventually, all region servers go offline 21

  22. HW faults vs. SW faults 22

  23. HW faults and modes • 299 improper handling of node fail-stop failure • A 25% normal speed memory card causes problems in HBase deployment. 23

  24. Software bug types • Logic (29%) • Error handling (18%) • Optimization (15%) • Configuration (14%) • Data Race (12%) • Hang (4%) - Deadlock • Space (4%) • Load (4%) Load Space Hang Race Config Opt Err-h Logic 24

  25. Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%) Corrupt Stale Loss Down Perf Opfail 25

  26. Software/Hardware Faults & Implications Long way from a highly dependable system. Catch all faults! 26

  27. Cloud Bug Study database (CBSDB) • a total of 21,399 issues (3655 vitals) • Open to public • Bug evolution analysis. 27

  28. System evolution Hadoop 2.0 28

  29. Conclude • The largest bug studies for cloud systems to date • Provide insights for a lot of intricate bugs • Unique bugs in cloud systems. • Killer bugs • Cloud Bug Study(CBS) database. 29

  30. Comments • This study includes a huge amount of human effort, not efficient and maintainable. • The study finds out the issues distribution, but do not have any suggestion or solution to them at all. • The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. • CBSDB is not active, involving large amount of maintaining time. • The author did not explicitly mention how are we supposed to use this study for future development. 30

  31. Thoughts and Discussion from Piazza • Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. • They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger • They say it is still far way 100% dependable. • Need an automatic analysing tool - Sanchit Gupta 31

More Related