VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability - PowerPoint PPT Presentation

Vernier virtualized execution realizing network infrastructures enhancing reliability l.jpg
Download
1 / 49

  • 138 Views
  • Uploaded on
  • Presentation posted in: General

VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability. Project Overview July 2006. Background. Commercial-off-the-shelf (COTS) software Large organizations, including DoD, have become dependent on it

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Vernier virtualized execution realizing network infrastructures enhancing reliability l.jpg

VERNIERVirtualized Execution Realizing Network Infrastructures Enhancing Reliability

Project Overview

July 2006


Background l.jpg

Background

  • Commercial-off-the-shelf (COTS) software

    • Large organizations, including DoD, have become dependent on it

    • Yet, most COTS software is not dependable enough for critical applications

      • Security breaches

      • Misconfiguration

      • Bugs

  • Large, homogeneous COTS deployments, such as those in DoD, accentuate the risk, since many users

    • Experience the same failures caused by the same vulnerabilities, configuration errors, and bugs

    • Suffer the same costly, adverse consequences

  • Alternatives, such as government-funded development of high-assurance systems present significant barriers in

    • Cost

    • Functionality

    • Performance


Vernier project objectives l.jpg

VERNIER Project Objectives

  • Develop new technologies to deliver the benefits of scaling techniques to large application communities

    • Provide enhanced survivability to the DoD computing infrastructure

    • Enhance the cost, functionality, and performance advantages of COTS computing environments

    • Investigate and develop new technologies aimed at enabling communities of systems running similar, widely available COTS software to perform more robustly in the face of attacks and software faults

  • Deliver a demonstrated, functioning, transition-ready system that implements these new AC survivability technologies

    • Technical approach: Augmented virtual machine monitor

    • Commercial transition partner: VMware, Inc.


Project scope l.jpg

Project Scope

  • Collaborative detection and diagnosis of failures

  • Collaborative response to failures

  • Advanced situational awareness capabilities

    • Collective understanding of community state

    • Predictive capability: Early warning of potential future problems

  • Key goal: turn the size and homogeneity of the user community into an advantage by converting scattered deployments of vulnerable COTS systems into cohesive, survivable application communities that detect, diagnose, and recover from their own failures

  • What COTS?

    • Microsoft Windows, IE, Office suite, and the like


Research challenges l.jpg

Research Challenges

  • Extracting behavioral models from binary programs

    • Breakthrough novel techniques required

    • Quasi-static state analysis for black-box binaries

  • Scaled information sharing

    • Networked application communities sharing knowledge about the software they run

  • Intelligent, comprehensive recovery

  • Predictive situational awareness

    • Automatic, easy-to-understand gauges


Breakthrough capabilities l.jpg

Breakthrough Capabilities


Expected results and impact l.jpg

Expected Results and Impact

  • COTS Product (VMware) with breakthrough capabilities for application communities

  • Scalability to 100K nodes running augmented VMware and custom Vernier software

  • Automatic collaborative failure diagnosis and recovery

  • Survivable robust system

  • Community-aware solution


Vernier team l.jpg

VERNIER Team

  • SRI International, Menlo Park, CA

    • Patrick Lincoln, Principal Investigator

    • Steve Dawson, Project manager; integration

    • Linda Briesemeister, Knowledge sharing; collaborative response

    • Hassen Saidi, Learning-based diagnosis; code analysis; situation awareness

  • Stanford University

    • John Mitchell, Stanford PI; code analysis; host-based detection and response

    • Dan Boneh, Knowledge sharing protocols

    • Mendel Rosenblum, VMM infrastructure; collaborative response; transition liaison

    • Alex Aiken, Quasi-static binary analysis

    • Liz Stinson, Botswat; system security

  • Palo Alto Research Center (PARC)

    • Jim Thornton, PARC PI; configuration monitoring and response; situation awareness

    • Dirk Balfanz, Community response management

    • Glenn Durfee, Configuration monitoring and response; situation awareness

  • Technology transition partner: VMWare, Inc.


Vernier technical approach l.jpg

VERNIER Technical Approach


Notional host system architecture l.jpg

Notional Host System Architecture


An abstraction based diagnosis capability for vernier l.jpg

An Abstraction-Based Diagnosis Capability for VERNIER


Objectives l.jpg

Objectives

Based on the general principle: “much of security amounts to making sure

that an application does what it is suppose to do…….. and nothing else!”

  • Build models of applications behaviors (what the application is suppose to do).

  • Monitor applications behavior and report malfunctions and unintended behaviors (deviations from behavior).

  • Use the recorded execution traces as raw data to a set of abstraction-based diagnosis engines (why did the deviation from good intended behavior occurred……to the extent to which we can do a good job answering such question).

  • Share the state of alerts and diagnosis among the nodes of the community (sharing the bad news.…but also the good ones!).

  • Aggregate the diagnosis outputs and the alerts into a situation awareness gauge.


Approach l.jpg

Approach

We combine a set of well known and well established techniques:

  • building increasingly accurate models of applications behaviors:

    • Static analysis combined with predicate abstraction to build Dyck and CFG models used for static analysis-based intrusion detection

  • Implement mechanisms for monitoring sequences of states and actions of an application for the following purposes:

    • Check if a known bad sequence is executed (signature-based!)

    • Check for previously unknown variations of known bad sequences (correlation!)

    • Find root-causes for unexpected malfunction and malicious exploits (Diagnosis)

  • Diagnosis is performed using techniques borrowed from

    • Delta-debugging (root-cause diagnosis)

    • Anomaly detection (correlation)

  • The situation awareness gauge is implemented as a platform independent web interface


Monitoring based diagnosis l.jpg

Monitoring-Based Diagnosis

  • We combine these techniques into two phases:

    • Monitoring: Applications are monitored and sequences of executions along with configurations are stored.

    • Diagnosis: Differences between good runs and bad runs are the first clues used for diagnosis

  • Traces of executions are sequences of:

    • System calls

    • Method calls

    • Changes in configurations

    • The more information is stored, the better chance that malfunctions and malicious behaviors are properly diagnosed.


Quasi static binary analysis and predicate abstraction based intrusion detection l.jpg

Quasi-static binary analysis and predicate abstraction-based intrusion detection

  • Use static analysis for recovering the control flow graph the application.

    • CFG generated by compliers for source code.

    • Recover class hierarchy for object code of OO applications.

  • Build a pushdown system which is a model that represents an over approximation of the sequences of methods and system calls of the application.

    • Deal with context sensitivity to match exit calls to return locations.

  • Use predicate abstraction and data flow analysis to refine the pushdown system and obtain a more accurate model.

    • Improving the knowledge about arguments to monitored calls.


Better models and better monitoring l.jpg

Better Models and Better Monitoring

We are not just interested in detection intrusions, but by

also generating high-level explanations of why an

application deviates from its intended behavior.

  • CFG and Dyck models are all over-approximations of the applications behavior (potential attacks are only discovered when the application behavior deviates from the model).

  • We will use the runs of the application to generate under-approximations of the applications behavior!

  • Alternatively, ever model representing an over-approximation has a dual that represents an under-approximation (over and under-approximations don’t have to be the same type of models!).

  • We will combine over and under approximation to reduce the risk of missing possible attacks.

  • We will refine the over and under approximations to improve the application model.


Combining over and under approximations l.jpg

Behavior outside the

over approximation

Is unsafe

Behavior in between

Is suspicious and

Is source of diagnosis

Behavior within the

under approximation

Is safe

Combining over and under approximations

Over approximation

(constructed by static analysis)

Under approximation

(constructed from runs)


What if we don t have a model of the application l.jpg

What if we don’t have a model of the application?

  • We can monitor the application as a blackbox and intercept system calls:

    • Learn a model of good behaviors

    • Learn a model of bad behaviors

  • Anomalies are difference between good and bad behaviors

  • Borrow from delta-debugging techniques to find root-causes of misbehaviors


Configuration based detection diagnosis recovery and situational awareness l.jpg

Configuration-based Detection, Diagnosis, Recovery, and Situational Awareness


Importance of configuration l.jpg

Importance of Configuration

  • Static configuration state highly correlated with system behavior

    • Many attacks/bugs/errors introduced by way of a substantive change to configuration

      “A central problem in system administration is the construction of a secure and scalable scheme for maintaining configuration integrity of a computer system over the short term, while allowing configuration to evolve gradually over the long term” – Mark Burgess, author of cfengine


Ac opportunity l.jpg

Reliability

Want to be here

Adaptability

AC Opportunity

  • Leverage scale of population to learn what are bad states in configuration space

Today: Every configurationchange is an uncontrolledexperiment

AC Future: Configurationchanges managed as controlledreversible trials


Live monitoring of configuration state l.jpg

Live Monitoring of Configuration State

  • State analysis

    • Comparative diagnosis

    • Vulnerability assessment

    • Clustering similar nodes and contextualizing observations

  • Detect change events

    • Cluster low-level changes into transactions

    • Log events for problem detection, mitigation and user interaction

    • Share events in real-time for situational awareness

  • Active learning

    • Automated experiments to isolate root causes

    • Managed testing of official changes like patch installation


  • Live control of configuration state l.jpg

    Live Control of Configuration State

    • Modification for Reversibility and Experimentation

      • Coarse-grained: VM rollback

      • Medium-grained: Installer/Uninstaller activation

      • Fine-grained: Direct manipulation of low-level state elements

    • Prevention

      • In-progress detection of changes

      • Interruption of change sequence

      • Reversal of partial effects


    Identifying badness l.jpg

    Identifying Badness

    • Objective Deterministic Criteria

      • Rootkit detection from structural features

      • Published attack signatures

    • Objective Heuristic Criteria

      • Performance outside of normal parameters

    • Subjective End-User Report

      • Dialog with user to gather info, e.g. temporal data for failure appearance

    • Administrative Policy

      • Rules specified by administrators within community


    Local components l.jpg

    Local Components

    Community

    3

    App VM

    COTS

    VERNIER VM

    Experimental VM

    Console(UI)

    Comm

    Diag

    App 1

    App 2

    App 1

    App 2

    Agent

    Agent

    VERNIER Monitor/Control

    1

    1

    App OS

    App OS

    VERNIER OS Base

    2

    VMM (VM Kernel)


    Key interfaces l.jpg

    Key Interfaces

    VERNIER-Agent

    (TCP/IP, XML?)

    Registry change events

    Filesystem change events

    Install events

    Manipulate registry

    Manipulate filesystem

    Control System Restore

    VERNIER-VMM

    (?)

    Suspend

    Resume

    Checkpoint

    Revert

    Clone

    Reset

    Lock memory

    Process events

    Read memory

    Read/write disk

    1

    2

    3

    • VERNIER-Community

    • (?)

    • Cluster management

    • Experience reports

    • Unknown

    • Prevalent

    • Known Bad

    • Presumed Good

    • State exchange

    • Experiment request/response


    Local functions l.jpg

    Community

    ConfigChange

    Detector

    NetworkEvent

    Detector

    BehaviorEvent

    Detector

    Local Functions

    NetworkTap

    Console

    Communication Manager

    ResponseController

    Analysis &

    Diagnosis

    Configuration

    Analysis

    AgentInside

    Event Stream

    BehaviorAnalysis

    TrafficAnalysis

    Local DB

    Local condition detail

    Event logs

    Labeled condition signatures

    State snapshots

    Experimental data

    VMM

    Firewall


    Adapting and extending host based run time win32 bot detection for vernier l.jpg

    Adapting and Extending Host-based, Run-time Win32 Bot Detection for VERNIER


    Exploit botnet characteristic ongoing command and control l.jpg

    Exploit botnet characteristic: ongoing command and control

    • Network-based approaches:

      • Filtering (protocol, port, host, content-based)

      • Look for traffic patterns (e.g. DynDNS – Dagon)

      • Hard (encrypt traffic, permute to look like ‘normal’ traffic, …); botwriters control the arena.

    • Host-based approaches:

      • Ours: Have more info at host level.

        Since the bot is controlled externally, use this meta-level behavioral signature as basis of detection


    Our approach l.jpg

    Our approach

    • Look at the syscalls made by a program

      • In particular at certain of their args – our sinks

    • Possible sources for these sinks:

      • local: { mouse, keyboard, file I/O, … }

      • remote: { network I/O }

    • An instance of external control occurs when data from a remote source reaches a sink

    • Surprisingly works really well: for all bots tested (ago, dsnx, evil, g-sys, sd, spy), every command that exhibited external control was detected


    Big picture l.jpg

    Big picture


    Design l.jpg

    Design


    Two modes l.jpg

    Two modes

    • Cause-and-effect semantics:

      • Tight relationship between receipt of some data over network and subsequent use of some portion of that data in a sink

    • Correlative semantics: looser relationship

      • Use of some data that is the same as some data received over the network

      • Why necessary?


    Behaviors ideally disjoint @ lowest level in call stack l.jpg

    Behaviors: ideally disjoint;@ lowest level in call stack


    Correlative semantics l.jpg

    Correlative semantics

    • Why necessary

    • Why bots with C library functions statically linked in ~= unconstrained OOB copies

    • In general almost as good as cause-and-effect semantics (stat vs. dyn link)

      • Exceptions: cmds that format recv’d params (e.g. via sprintf)


    Benign program testing l.jpg

    Benign program testing

    • Tested against some benign programs that interact with the network

      • Firefox, mIRC, Unreal IRCd

    • 3 contextual false positives

      • IRCd: sent on X heard on Y

      • Firefox: dereferencing embedded links

    • Artificial false positives: quite a few

      • mIRC: DCC capabilities

      • Firefox: saving contents to a file, …


    False positives l.jpg

    False positives

    • contextual false positives – not present in bots

      • external control heuristic correctly detected but these actions under these circumstances widely accepted as non-malicious

    • artificial false positives – not present in bots

      • def of external control implies no user input agreeing to particular behavior

      • but we don’t track “explicitly clean” data (that received via kb, mouse)

    • spurious false positives

      • any other incorrect flagging of external control


    Our mechanism review l.jpg

    Our mechanism — review

    • Single behavioral meta-signature detects wide variety of behaviors on majority of Win32 bots

      • Resilient to differences in implementation

    • Resilient in face of unconstrained OOB copies

    • Resilient to encryption – w/some constraints

    • Resilient to changes in command-and-control protocol (e.g. from IRC to HTTP) and parameters (e.g. for rendezvous point)


    Knowledge sharing in vernier l.jpg

    Knowledge Sharing in VERNIER


    Knowledge sharing l.jpg

    Knowledge Sharing

    • Need: Communication is the core concept of a community

      • Application communities rely on ability to share knowledge Reliable, Efficient, Authentic, Secure

    • Approach: two-tier peer-to-peer platform

      • Tuple space (ala Linda)

      • Considering JavaSpaces implementation of tuple spaces

      • Two-tier for better scalability

        • If needed, hypercube hashtable index (ala Obreiter and Graf)

    • Benefits: Reliable, efficient (local) knowledge sharing

    • Competition: Other possible methods for knowledge sharing include explicit messaging, centralized database, and statically indexed knowledge structures.

      • Other approaches lack scalability, are unreliable, and can bedifficult to secure


    Knowledge sharing levels l.jpg

    Knowledge Sharing Levels

    • Lower level (within a cluster)

      • Tuple space (ala Linda (Gelernter))

      • Simple queries

        • (*, name, *) returns records regarding ‘name’

      • Concurrent access and update

    • Higher level (supernodes)

      • Nodes aggregate knowledge of an entire cluster

      • Use abstraction to summarize current situation

      • Application-level multicast to push out summaries

      • Supernode pushes all summary updates into local tuple space


    Group communication l.jpg

    Group Communication

    • Group communication is key

      • For higher level, certain usual assumptions

        • Reliable delivery

        • Ordered message delivery

    • Spread (www.spread.org) as a basis for implementation of group communication

      • Building on secure spread and progress software (progress.com)’s more secure, reliable, scalable variants of spread


    Group communication security and privacy secrecy and authenticity l.jpg

    Group Communication Security and Privacy: Secrecy and Authenticity

    • Security and privacy are critical aspects of VERNIER

    • Must authenticate reports and ensure correctness

    • Confidentiality of reports

      • Protecting user privacy (my files, my keystrokes)

      • Protect aspects of applications

      • Protect configuration information

      • Protect vulnerability detection information

    • Community members send status reports to local supernode

    • Reports propagated throughout network


    Group communication security l.jpg

    Group Communication Security

    • Defense against:

      • network attacks sending forged messages to supernodes

        + PKI

      • Compromised community member sending false reports

        + statistical anomaly detection (eg EMERALD)

        + Virtualization

        Any report generated within compromised virtual machine must be consistent with what is observed outside the virtualization layer


    Group communication security45 l.jpg

    Group Communication Security

    • Secure audit logs

      • Secure log of all P2P status reports

      • Enable post-mortem analysis on detected attacks

      • Cryptographic protection of log (Boneh, Waters)

    • Sanitizing stats reports

      • Status reports reveal private information

      • Special encryption enabling read only by credentialed membersand search (as in search over encrpyted database) by community

    • Mitigating denial of service attacks on supernodes

      • Re-election of supernodes when under attack

    • Securing configuration update messages

      • PKI authenticating legitimate reports from community members


    Schedule experimentation and evaluation l.jpg

    Schedule, Experimentation, and Evaluation


    Schedule and milestones l.jpg

    Schedule and Milestones


    Experimentation and evaluation l.jpg

    Experimentation and Evaluation

    • Project testbed

      • Network of 300 virtual hosts

        • 30 server-class physical hosts

        • 10 virtual nodes per server

      • Three clusters, one at each participant site

    • Software

      • Host OS: Linux

      • Guest (community) OS: Microsoft Windows

      • Applications: IE browser (possibly others); MS Office

    • Simulations and scalability

      • Financially infeasible to scale to thousands of nodes

      • Plan is to use hybrid simulation to test scalability

        • Real (live) nodes provide actual data

        • Simulated nodes use synthesized data generated by perturbing data collected from real clusters’ supernodes


    Proposed success criteria l.jpg

    Proposed Success Criteria

    • Metrics and targets (team-defined)

      • False positives (FP) / False negatives (FN)

        • Phase 1: FP < 10%, FN < 20%

        • Phase 2: FP < 1%, FN < 2% (order of magnitude improvement)

      • Percent loss of network availability

        • Phase 1: At most 20% per node, with at most 80% over any 500ms interval

        • Phase 2: At most 5% per node, with at most 20% over any 500ms interval

      • Average time to recovery

        • Phase 1: Assuming a fix exists (not a FN), at most 30 minutes to recover the entire community

        • Phase 2: At most 10 minutes

      • Average network and computational overhead

        • No more than 30% slowdown for applications

        • No more than 100 KB/s average VERNIER-induced network traffic per node

      • Percent accuracy of prediction

        • Phase 1: Effects of problems predicted within 15 minutes of onset; set of nodes wrongly predicted (either way) differs by no more than 40% of actual

        • Phase 2: Prediction within 5 minutes; predicted set differs by no more than 20%


  • Login