1 / 55

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

This study examines the bugs and issues found in various cloud systems, focusing on reliability, performance, availability, security, consistency, scalability, topology, and QoS. The study analyzes a database of over 21,000 issues, with a detailed analysis of 3,600 vital issues. The findings provide valuable insights for improving cloud dependability tools in the future.

slayman
Download Presentation

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Haryadi S. Gunawi, MingzheHao, TanakornLeesatapornwongsa, TiratatPatana-anake Thanh Do Jeffry Adityatama, Kurnia J. Eliazar, AgungLaksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

  2. First, let’s ask Google

  3. Cloud era No Deep Root Causes…

  4. What reliability research community do? • Bug study • A Study of Linux File System Evolution. In FAST ’13. • A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. • Precomputing Possible Configuration Error Diagnoses. In ASE ’11. …

  5. Open sourced cloud software • Publicly accessible bug repositories

  6. Study to solve… • What bugs “live” in the cloud? • Are there new classes of bugs uniqueto cloud systems? • How should cloud dependability tools evolve in near future? • Many others questions…

  7. Cloud Bug Study (CBS) • 6 systems: HadoopMapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 11 people, 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600 “vital” issues  in-depth study • Cloud Bug Study (CBS) database

  8. Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware failures- types of hardware and types of hardware failures • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption • ~25000 annotations in total, about 7 annotations per issue

  9. Cloud Bug Study (CBS) database • Open to public

  10. Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion

  11. Methodology • 6 systems, 3-year span, 2011 to 2014 • 20~30 bugs a day! Protein yeah! • 17% “vital” issues affecting real deployments • 3655 vital issues

  12. Example issue Title Time to resolve Type & Priority Description Discussion

  13. Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion

  14. Classifications for each vital issue • Aspects • Hardware types and failure modes • Software bug types • Implications • Bug scopes

  15. Overview of result • Aspects • Hardware faults vs. Software faults • Implications

  16. Aspects • CS = Cassandra • FL = flume • HB = HBase • HD = HDFS • MR = MapReduce • ZK = ZooKeeper

  17. Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness

  18. Aspects: Performance • Reliability • Performance (22%)

  19. Aspects: Availability • Reliability • Performance • Availability (16%) • Node and cluster downtime

  20. Aspects: Security • Reliability • Performance • Availability • Security (6%)

  21. Overview of result • Aspects (classical) • Aspects • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

  22. Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures

  23. Cassandra cross-DC synchronization A’ A’ A’ A B B’ B’ B’ Permanent inconsistency C’ C’ C Background operational protocols often buggy!

  24. Aspects: Scalability • Data consistency • Scalability (2%) • Small number does not mean not important! • Only found at scale • Large cluster size • Large data • Large load • Large failures

  25. Large cluster • In Cassandra Ring position changed. 100x O(n3) calculation CPU explosion

  26. Large data In HBase Insufficient lookup operation Tens of minutes R1 R… R2 R100K R3

  27. Large load 1000x small files in parallel In HDFS … … … Not expecting small files!

  28. Large failure Un-optimized connection AM managing 16,000 tasks fails … 1 5K Time cost: 7+ hours 2 1K 3K … 2K 4K 16K 3

  29. From above examples… • Protocol algorithms must anticipate • Large cluster sizes • Large data • Large request load of various kinds • Large scale failures • The need for scalability bug detection tools

  30. Aspects: Topology • Data consistency • Scalability • Topology (1%) • Systems have problem when deployed on some network topology • Cross DC • Different racks • New layering architecture • Typically unseen in pre-deployment

  31. Aspects: QoS • Data consistency • Scalability • Topology • QoS (1%) • Fundamental for multi-tenant systems • Two main points • Horizontal/intra-system QoS • Vertical/cross-system QoS

  32. Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

  33. HW faults vs. SW faults “Hardware can fail, and reliability should come from software.”

  34. HW faults and modes • 299 improper handling of • node fail-stop failure • A 25% normal speed • memory card causes problems • in HBase deployment.

  35. Hardware faults vs. Software faults • Hardware failures, components and modes • Software bug types

  36. Software bug types: Logic • Logic (29%) • Many domain-specific issues

  37. Software bug types: Error handling • Logic • Error handling (18%) • Aspirator, Yuan et al, [OSDI’ 14]

  38. Software bug types: Optimization • Logic • Error handling • Optimization (15%)

  39. Software bug types: Configuration • Logic • Error handling • Optimization • Configuration (14%) • Automating Configuration Troubleshooting. [OSDI ’10] • PrecomputingPossible Configuration Error Diagnoses. [ASE ’11] • Do Not Blame Users for Misconfigurations. [SOSP ’13]

  40. Software bug types: Race • Race (12%) • < 50% local concurrency bugs • Buggy thread interleaving • Tons of work • > 50% distributed concurrency bugs • Reordering of messages, crashes, timeouts • More work is needed • SAMC [OSDI ’14]

  41. Software bug types: Hang • Hang (4%) • Classical deadlock • Un-served jobs, stalled operations, … • Root causes? • How to detect them?

  42. Software bug types: Space • Space (4%) • Big data + leak = Big leak • Clean-up operations must be flawless.

  43. Software bug types: Load • Load (4%) • Happen when systems face high request load • Relates to QoS and admission control

  44. Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

  45. Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%)

  46. Root causes Every implication can be caused by all kinds of hardware and software faults!

  47. “Killer” bugs • Bugs that simultaneously affect multiple nodes or even the entire cluster • Single Point of Failure still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • …

  48. Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion

  49. CBS database • 50+ per system and aggregate graphs from mining CBS database in the last one year • Still more waiting to be studied…

  50. Components with most issues How should we enhance reliability for multiple cloud system interaction? Cross-system issues are prevalent!

More Related