1 / 31

Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007

MURI: Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction: System-Level Design & Implementation. Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007. Outline. Recovery angle of enterprise health care The recovery problem

plato
Download Presentation

Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MURI: Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction:System-Level Design & Implementation Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007

  2. Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in

  3. The recovery problem:(1) Patient systems A patient mankind A “patient” system App processes OS Components: code, stack, heap, (VM) pages, files, sockets, PCB, page tables, registers, sys. calls, drivers, … Threat: virus infection

  4. (2) Why a compromised system could be called a patient A “patient” system A patient mankind Process Organ Text, stack, heap, pages, files Tissues Memory unit, register, disk block, … Cells OS Neuro + blood PCB, page tables, drivers, sys. calls, scheduler, sockets, interceptions, … Neuro sub-systems, Blood sub-systems Memory unit, register, disk block, … Cells

  5. The recovery problem:(3) System state transition A system’s state is determined by the state values of its components: stack, heap, files, registers, … Component x is poisoned by attack at 9:30am Time State at 8am State at 9am State at 10am State at 5pm … … … … … Checkpoint C-8am C-9am C-10am C-5pm Fact: “infection” can propagate

  6. (4) Simplest full-system recovery: before Lymph, gallbladder, etc. Body at week 10 Body at week 1 Body at week 2 Body at week 3 … … Time Component x (liver) is poisoned by attack at 9:30am State at 9am State at 10am State at 5pm State at 8am … … … … … Checkpoint C-8am C-9am C-10am C-5pm

  7. (5) Simplest full-system recovery: after Body at week 1 Body at week 2 Bob’s memory after week 2 is lost: very painfulfor Bob Component x (liver) is poisoned by attack at 9:30am Time The work after 9am is lost State at 8am State at 9am Memory-less recovery Checkpoint C-8am C-9am

  8. (6) Memory preserving, full-system recovery • What is memory-preserving recovery? • When we perform surgeries on the liver, do not roll-back the state of the brain • When we repair an infected process, do not roll-back any uninfected process • Memory-preserving recovery requires fine-grained process-level (i.e., organ-level) and operation-level forward correction surgeries • Memory-preserving recovery is challenging • Due to infection propagation, it is hard to know which (part of an) organ should be cut-off and which should be kept

  9. (7) Full-body anesthesia vs. local anesthesia • There are two ways to perform surgeries: • Full-body “anesthesia”: The machine is halted during recovery • Local “anesthesia”: The uninfected processes can still be executed as usual • For non-stop enterprise computing, local “anesthesia” is required

  10. The recovery problem:(8) Infection quarantine • Why quarantine? • The under-repair components are infectious prevent infecting clean processes • Execution of the uninfected processes may interfere with the surgeries  guarantee correctness • Quarantine = “disinfection” + local “anesthesia” • Quarantine strategies • Two-way quarantine • One-way quarantine

  11. The state-of-the-art • Memory-less recovery • Re-playable systems • Process checkpointing • Process migration • Memory-preserving subsystem recovery with full-body anesthesia

  12. The state-of-the-art(1) memory-less recovery • One-button recovery • A standard feature in laptops (HP, Dell, etc.) • The OS will lose all “memory” • Simplest full-system recovery • Checkpoint-based • E.g., the whole state of a VM at time t can be copied to disk (State Procurement) • Will lose “memory” after the moment of attack

  13. The state-of-the-art(2) re-playable systems • E.g., Revirt can log and replay all operations of a virtual machine • re-playable ≠ recoverable • Revirt cannot detangle bad operations from good ones • Revirt cannot replay only the unaffected good operations • Revirt cannot do forward correction • Revirt cannot do local anesthesia • Revirt cannot quarantine infection

  14. The state-of-the-art(3) process checkpointing • Per-process checkpointing: Flashback (and Rx) can checkpoint the whole state of a process at time t in RAM • checkpoint-able ≠ recoverable • Flashback cannot detangle bad operations from good ones within the same process • Flashback cannot track taint-propagation channels • Flashback cannot do forward correction • Flashback cannot quarantine infection

  15. The state-of-the-art(4) process migration • Process migration: • A Pod is a group of processes “tangled” with each other • Zap can migrate a Pod from machine A to B • Migrate-able ≠ recoverable • Zap cannot detangle bad operations from good ones • A partially infected Pod has to be totally “discarded” • Zap cannot track taint-propagation • Zap cannot do forward correction

  16. (5) Memory-preserving subsystem recovery with full-body anesthesia • Taser can do memory-preserving recovery, however, • Not full-system recovery: It can only repair file systems • Taser requires full-body anesthesia • Taser cannot quarantine infection • Taser cannot do on-the-fly surgeries • Compared with our blueprint, Taser does not have the capabilities to do: • Remote surgeries • Nested recovery • Replicated Recovery • Non-stop Recovery

  17. Our goal • Do memory-preserving, self-recoverable, non-stop enterprise computing: • Fine-grained recovery surgeries • Forward correction • Keep good “memory” in a consistent way • Remove bad “memory” • Local “anesthesia” • Quarantine infection during recovery • Transparent to uninfected processes

  18. Challenges: Multi-Granularity Recovery • Machine-level recovery • Processes are usually “tangled” with each other • It is not hard to checkpoint a VM, but • It is hard to detangle bad operations from good ones • Pod-level recovery • Zap can checkpoint and migrate a Pod, but • It is hard to do detangling • A partially infected Pod has to be totally “discarded” • Process-level recovery • A partially infected process has to be totally “discarded” • Need to track taint-propagation channels • Operation-level recovery: the desired granularity • Need fine-grained surgeries inside the “body” of a VM • Very hard to do selective replay or migration • Tough tradeoffs between recoverability and consistency

  19. Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in

  20. Recovery Services: Roadmap Initial Capability Gold Capability • Focus: processes, files • Logger: VMM based • Atomicity: per-process • Dependency analysis • Quarantine via VMM • Roll-Forward correction • Nested recovery • - Intra-process checkpointing • - Nested transactions Platinum Capability • Replicated recovery • - Heterogeneous VM replica • - Standby VM Silver Capability • Holistic recovery • - Sockets, shared memory, • DBMS, attributes, … • - Control dependencies: • process forking, workflows • -Remote “surgeries” • - EHCC sends instructions • to remote surgery agents Diamond Capability • Non-stop recovery • - Transparent switching • - Stateful migration

  21. New recovery capabilities: basic ones

  22. New recovery capabilities: advanced ones • New capabilities can • Provide transactional atomicity & consistency • Do non-stop warm-start or hot-start recovery • Perform remote surgeries • Do intra-process checkpointing • Do nested recovery within a process • Do heterogeneous VM replication • Construct standby VM • Do stateful recovery-driven process migration • Side benefits: • Improved observation/inspection capability • Improved diagnosis/forensics capability • Improved detection capability

  23. Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in

  24. Year one: Initial Capability • Scope: local health care • Focus: app processes, files • Logging: VMM based • Atomicity: per-process • Dependency analysis based detangling • Local anesthesia via host kernel • Quarantine via VMM • On-the-fly, roll-forward correction

  25. System architecture App A App B Display process Stack Timer Log Heap Dependency Analyzer Keyboard Task structure Guest OS Guest OS Ports CPU VMM auditor Quarantine Task structure Roll-Forward Correction Instruction Generator Disks Hook Cache Surgery Agent Host Kernel Drivers

  26. Why run “patient” systems in a VM? • Enhanced security • App processes are isolated in separate VM • The host kernel does not directly interact with the app processes • Although any component of a “patient” system may be compromised, the host kernel is quite safe • The audits and recovery code are well protected • Enhanced observation/inspection capability • Much easier to do local anesthesia • Much easier to quarantine • Much easier to perform repair surgeries • Downside: performance degradation

  27. Year one work plan • Team 1: QEMU-based implementation • Team 2: UML-based implementation • Each team has two graduate students • Goal of each implementation: • Phase I: be able to do incremental logging • Phase II: be able to do damage assessment and detangling • Phase III: be able to perform on-the-fly repair “surgeries”

  28. Phase I: do incremental logging • VM state-procurement techniques are recently proposed, but • If the checkpoints are taken frequently  too much overhead • If the checkpoints are not taken frequently  “memory” loss • A better idea is logging only the changes • Any operation could change the state • If we log every state change  too much • So what changes do not need to be logged? • Are we able to log all changes? • QEMU-based CPU emulator can log every change • UML-based logger can log every trap to OS

  29. Phase II: do damage assessment • The goal is to detangle tainted operations from untainted operations • Dependency analysis is required in order to do detangling • We have built various kinds of dependency graphs for data processing systems • We will extend these graphs to capture the taint-propagation channels in a VM • Fine-grained VM information flow analysis techniques are recently proposed, • Although their purpose is intrusion detection, they may be applied to serve our recovery purposes

  30. Phase III: perform on-the-fly repair surgeries • How to do local anesthesia? • Let the host kernel not schedule any tainted process that is under surgery • How to quarantine? • Let the VMM enforce the quarantine policy • Two-way quarantine: the tainted components are totally contained • One-way quarantine: a tainted process may access a version of an untainted component, but, not vise versa • How to do forward correction surgeries? • Naïve idea: Replace the state value of every tainted component with a clean version • Real challenge: How to keep the consistency among the clean versions of e.g. 50 components

  31. Questions?

More Related