1 / 17

Recovery Management in QuickSilver

Recovery Management in QuickSilver. Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center. Introduction: Problem Domain. Recovery management in distributed OSs Trends in contemporary research: Extensibility and Distribution.

fawzia
Download Presentation

Recovery Management in QuickSilver

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovery Management in QuickSilver Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center

  2. Introduction: Problem Domain • Recovery management in distributed OSs • Trends in contemporary research: • Extensibility and Distribution

  3. Contemporary Recovery Techniques • timeouts • how to distinguish slow from dead? • connectionless protocols / stateless servers • some actions can’t be made idempotent • retries can cause problems • virtual circuits • can’t handle multiple servers • replication • too expensive for some uses • how to detect failures?

  4. Quicksilver: what’s so special? • Fundamental Trade-Off: • Generality & efficiency vs. Ease of use (Quicksilver)(Camelot, Argus, etc.) Transparency isn’t always best!

  5. Quicksilver: specs and features • Client-server model • System services are processes • IPC message-passing • More complicated set of failure modes (to handle more specific cases) • Atomic transactions

  6. Server Classes Common server classes: • Volatile (window manager) • Replicated + volatile (name server) • Recoverable (file server) • Long running transactions need log support

  7. Design Goals • Programs should be resilient to external process and machine failure • Server processes should contain their own recovery code • Uniform system-wide architecture for recovery management • Logically related activities must execute atomically

  8. Transaction Structure • Everything belongs to a transaction • Globally unique transaction identifiers (tid) • Each transaction has one owner and multiple participants • Owner can commit or abort • Participants can only abort

  9. Recovery Manager: Components • Transaction Manager: manages commit coordination by communicating with servers at its own node and with transaction managers at other nodes • Log Manager: serves as a common recovery log both for the TM’s commit log and the server’s recovery data • Deadlock Detector: detects and resolves global deadlocks (not implemented)

  10. Quicksilver System Structure

  11. Transaction Manager • Tracks transactions for processes on host • Manages distributed commit protocol • Distributed transaction is a tree • Only need to know your superior and your immediate subordinates • Several alternative commit protocols available to servers • 1-phase – used by volatile servers • 2-phase – used by recoverable servers

  12. 2-Phase Commit • Voting options • abort: undo my action, announce abort to others in 2nd phase • commit-read-only: no recoverable resources modified, don’t include me in 2nd phase • commit-volatile: same as read-only, but notify me of results of 2nd phase • commit-recoverable: recoverable state modified, notify me of results of 2nd phase

  13. Transaction Coordination • Transaction coordinator at transaction birth-site • Usually a user workstation, likely to fail • Migrate or replicate coordinator for reliability

  14. Log Manager • Log manager provides optional services • Backpointers for log replay • Block I/O access • Log replication • Log archival • Servers tell LM what they need • Not penalized for services they don’t use • LM does not interpret data – servers determine recovery strategy

  15. Quicksilver Distributed IPC

  16. Structure of a Distributed Transaction

  17. Open questions - ??? • Efficiency vs. Transparency? • Still relevant for today’s hardware? • …

More Related