1 / 30

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures

This presentation by Daniel Taylor discusses the motivation behind treating software bugs as allergies, different approaches to surviving software failures, and introduces the Rx approach which is comprehensive, safe, noninvasive, efficient, and informative.

pressley
Download Presentation

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures Presented by: Daniel Taylor

  2. Outline • Motivation • Approaches to surviving failures • Rx Approach • Rx Design • Experimental Results • Future Work • Evaluation • Discussion

  3. Motivation • System Availability • Gartner report: 1 hour of downtime = $6 million • Affected by software failures • Software defects cause up to 40% of system failures • Memory-related and concurrency bugs account for over 60% of system vulnerabilities

  4. Motivation • Treat bugs as allergies • Examples of environmental bugs • Memory management • Buffer overflows • Dangling pointers • Timing • Races • Message ordering • User Request • Malicious users • Bad requests

  5. Approaches to surviving failures • 1) Rebooting/System restart • Designed for hardware failures • Fail in fixing deterministic bugs • Unavailability • Warm-up period • Micro-rebooting

  6. Approaches to surviving failures • 2) Checkpointing and recovery • Checkpoint, rolback on failure, re-execute • Designed for hardware failures • Fail in fixing deterministic bugs • Progressive retry – method to re-order messages • Only works for message ordering bugs • N-version programming – run different implementation on re-execution • Requires extra software development

  7. Approaches to surviving failures • 3) Application-specfic recovery • Multi-process model • Spawn new processes if old ones fail • Cannot deal with deterministic bugs • Cannot deal with shared data corruption • Exception handling • Programmer must expect failures

  8. Approaches to surviving failures • 4) Non-conventional methods • Failure-oblivious computing • Artificial values for buffer overflows • Reactive immune systems • Speculative error code for crashed functions • Unsafe methods, not appropriate for critical applications • Hard to debug if the “fix” does something strange

  9. Rx Approach • Treat bugs like real-life allergies • Remove the allergen to see if it helps • Goals: • Comprehensive – survive software bugs • Safe - no uncertainty or introduced errors • Noninvasive – no modifications • Efficient – good performance, reduce downtime • Informative – help diagnose bugs

  10. Rx Approach • Keep checkpoints • Fail > Rollback > Change Environment > Re-Execute • Disable modifications if it succeeds

  11. Rx Approach • Execution Environment • Anything external to the application affecting it • Low Level – Hardware • Middle Level – OS Kernel: scheduling, VM system, FS, drivers, etc. • High Level – libraries • Change must be: • Correctness-preserving – follow API’s, do the same thing • Avoid bugs – potentially fix a software defect

  12. Rx Approach • Environmental changes and bugs

  13. Rx Design • 5 parts • Sensors • Checkpoint and Rollback (CR) • Environment Wrapper • Proxy • Control Unit

  14. Rx Design: Sensors • Detect failures and inform the control unit • Two types of sensors: • Detect software errors • Detect bugs before they cause crashes • Only the 1st is implemented • Provide information about the type of exception, memory address, and stack signature

  15. Rx Design: Checkpoint and Rollback • Checkpoints are automatically and transparently taken • Application memory, accessed files and file pointers are copied by copy-on-write • Kept in memory (no disk accesses), old checkpoints can be written to disk • Using checkpoints too far back takes too long

  16. Rx Design: Checkpoint and Rollback • Based on previous work, Flashback in 2004 • Because Rx doesn’t require determinism, it avoids overhead

  17. Rx Design: Environment Wrappers • Carry out the environment changes during re-execution • Memory Wrapper • Intercepts memory library calls (malloc, free, etc) • Supports 4 environmental changes • Delaying free • Padding buffers • Allocation isolation • Zero-filling • Safe, no changes to API

  18. Rx Design: Environment Wrappers • Message Wrapper • Implemented in the proxy, controls message ordering • Changes include • Shuffling requests • Randomized packet sizes • Helps avoid non-deterministic bugs • No change to execution – server should not expect any ordering or size

  19. Rx Design: Environment Wrappers • Process Scheduling • Change priority • Signal Delivery • Signals are recorded and can be delivered randomly • Dropping User Requests • Binary search to narrow down possible bad user request and drop

  20. Rx Design: Proxy • Records and replays messages on re-execution • Simply forwards messages during normal execution • On recovery, the proxy • Replays requests • Carries out message-related environment changes • Buffers incoming messages for after failure recovery • Keeps track which requests received responses

  21. Rx Design: Control Unit • Coordinates the other components and performs 3 functions: • Directs checkpointing and requests rollbacks • Diagnoses failures based on symptoms and experiences and chooses changes to use • Gives an information report for programmers • Keeps a failure table to judge how well each environmental change works for future reference

  22. Rx Design • Multi-threaded process checkpointing • Threads must be at the user level before taking a checkpoint because of kernel locks and state issues • A signal makes threads exit blocked calls to take the checkpoint, then Rx retries them • Big I/O problems with this method, cannot set checkpoint interval too short

  23. Experimental Results • 4 different sets of tests • Surviving failures • Performance overhead • Malicious requests • Learning from previous failures • Tested with 4 real applications • Apache httpd – web server • Squid – proxy server • MySQL – database server • CVS – version control server • 6 bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow, double free

  24. Experimental Results • Alternatives are the whole program restart or a rollback and re-execute method • Rx provides availability and is faster than restart methods except in the case of very simple programs (CVS) • If the bug is deterministic, restarting will likely cause a crash again

  25. Experimental Results

  26. Experimental Results

  27. Experimental Results

  28. Future Work • Inter-server communication • If Rx is on all systems, it can rollback any that it needs to when a failure occurs • Coordinated checkpoints • Unavoidable bugs/failures • Memory leaks – requires whole program restart • Deadlocks • Semantic bugs that have nothing to do with the environment • Undetectable bugs – need better sensors • Implement Proxy in the kernel level

  29. Evaluation • Safe/fast recovery of certain bugs, but not all bugs • Masks failures to users, provides availability • Rx was only tested on I/O bound applications, overhead may be larger for computation-based applications

  30. Discussion • Questions?

More Related