1 / 25

Shimin Chen LBA Reading Group Presentation

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05. Shimin Chen LBA Reading Group Presentation. Motivation. High availability is important Critical applications: process control, etc.

hanley
Download Presentation

Shimin Chen LBA Reading Group Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures.Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05 Shimin ChenLBA Reading Group Presentation

  2. Motivation • High availability is important • Critical applications: process control, etc. • Financial company: an hour of downtime costs $6 million • SW defects account for up to 40% of system failures • Common: memory-related bugs and concurrency bugs • Bugs still occur in production runs • Even after SW company spends enormous effort on testing  Ask for mechanisms for surviving software bugs

  3. Previous Work on Surviving SW Failures • Four categories: • Rebooting • Checkpointing and recovery • Application-specific mechanisms • Recent proposals: • Failure-oblivious computing • Reactive immune system

  4. Previous Work 1: Rebooting • Schemes: • Whole program restart • Micro-rebooting of partial system components • SW rejuvenation (proactively restart processes) • Problem: • Cannot deal with deterministic bugs • Restart time

  5. Previous Work 2: General checkpointing and recovery • Schemes: • Checkpoint, rollback, re-execute • Or use a backup server • Problems: • Cannot deal with deterministic bugs • Progressive retry in distributed systems: • Reorder messages to get around SW bugs, but not bugs on single system • N-version programming: • Too expensive

  6. Previous Work 3: Application-Specific Recovery Mechanisms • Multi-process model (MPM) • Kill a request-handling process and start a new one • Problems: • Cannot handle deterministic bugs • What if shared data structure is corrupted?

  7. Previous Work 4: Recent Non-Conventional Proposals • Failure-oblivious computing • Manufacture values for out-of-bound reads • Discard out-of-bound writes • Reactive immune system • Detect failures of function calls • Forcefully return from the function with a manufactured error return value (e.g. -1 for int, 0 for unsigned int etc.) • Problem: • Unsafe for correctness-critical applications (e.g. banking)

  8. New Proposal: Rx • Rollback the program to a recent checkpoint when a bug is detected • Dynamically change the execution environment based on the failure symptoms • Re-execute the buggy code in the new environment • Features: • Comprehensive: can deal with deterministic bugs • Safe: do not speculatively “fix” bugs, but change environment • Noninvasive: no changes to app source code • Efficient • Informative: help locating the bugs

  9. Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary

  10. Main Idea Record the changes for offline diagnosis

  11. Useful Execution Environmental Changes • Must be safe and may avoid bugs • Memory management based • Buffer overflows, dangling pointers, etc. • Timing based • Concurrency bugs • User request based • Dropping unexpected (malicious) user request • As a last resort

  12. Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary

  13. Rx Components Overview 4 1 2 3 5

  14. Sensors for Detecting SW Failures • OS-raised exceptions: • Assertion failures, segfault, divide-by-zero, etc. • Fine-grain detection: • buffer overflow, accesses to freed memory etc. • Only implemented OS-raised exceptions

  15. Checkpoint and Rollback (Flashback) • Memory state: fork-like operation • Files: keep a copy of each accessed files and file pointers for a checkpoint • Checkpoint management: • Equal intervals or exponential landmarks • Limit oldest checkpoint by considering recovery time goal • Multi-threaded process checkpointing • Send a signal to all threads to make them exit from blocked syscalls with EINTR • Take checkpoint • Library wrapper in Rx retries syscalls • High cost so cannot be frequent

  16. Environment Wrappers • Memory wrapper: (intercepting library calls) • Delaying free: • keep a freed buffer for a threshold (process) time • FIFO recycling • Padding buffers: • adds two fixed-size padding to both ends of allocated buffers • Allocation isolation: • put allocated buffers to isolated locations • Zero-filling • Do the above during re-execution for failed code region only

  17. Other Wrappers • Message wrapper (in proxy) • Randomly shuffle message orders of different connections while keeping the message order of the same connection • Randomize packet sizes • Process scheduling: change process’ priority • Signal delivery: randomize hw interrupt delivery time while preserving order • Dropping user requests • Binary search for bad requests • Drop at most 10% of requests

  18. Proxy

  19. Control Unit • Coordinate checkpoint/roll back, environment changes etc. • Failure vector <S1, S2, …, Sm> per failure symptom (exception type, PC adderss, call chain etc.) • Si is the score for environmental change #i • If change #i is successful, Si++; if failed, Si - - • Try the changes with scores greater than a certain threshold first

  20. Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary

  21. Setup • A client machine and a server machine • 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM • 100Mbps Ethernet Injected bugs

  22. Overall Results

  23. Checkpoint Overhead • Time: with checkpoint interval of 200ms, 5% overhead (MySQL) • Workloads: • apache, squid: 5 threads, GET files with size uniform [1KB, 512KB] • CVS: client exports a 30KB file • MySQL: 5 client threads, transactions on a small table

  24. Summary • Rx: re-executing the buggy program region in a modified execution environment • Not panacea: • Semantic bugs, resource leaks • Latent bugs (long delay from bug to symptom)

More Related