1 / 45

Cognitive Support for Intelligent Survivability Management

Cognitive Support for Intelligent Survivability Management. CSISM TEAM June 21, 2007. Outline. Introduction Status, results and plans for technical thrusts Multi-layer reasoning for cyber-defense administration Knowledge representation and rules for system wide reasoning (OLC)

Download Presentation

Cognitive Support for Intelligent Survivability Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cognitive Support for Intelligent Survivability Management CSISM TEAM June 21, 2007

  2. Outline • Introduction • Status, results and plans for technical thrusts • Multi-layer reasoning for cyber-defense administration • Knowledge representation and rules for system wide reasoning (OLC) • Fast containment response and policies (ILC) • Improving defense parameters and strategies by learning augmentation • Implementation and Integration • Conclusions

  3. CSISM Introduction and Background

  4. Problem Domain: Self-Regenerative Systems • Cyber-Defense • Survivable systems • Automated … • Self-improving…… Retain level of service and improve defense: Static and dynamic use of artificial diversity; Use of wide area distribution; Automated interpretation of observation and response selection, augmented by learning from past experience. Level of service w/o attack Level of service w/o attack Regenerative Regenerative Level of Level of service service Survivable Survivable rd rd (3 (3 Gen.) Gen.) Graceful degradation: Adaptive response limited to static use of diversity and policy; Event-interpretation and response selection by human experts. undefended undefended Start of focused attack Start of focused attack time time Our Focus: Automated interpretation of observation and response selection..

  5. Cyber-Defense Decision-Making Landscape

  6. Challenges • Goal: Automate the reasoning performed by expert cyber-defense administrators • Effective, reusable, easy to port and retarget • Challenges: • Making sense of low-level information (alerts, observations) to drive low-level defense-mechanisms (block, isolate etc.) such that higher-level objectives (survive, continue to operate) are achieved • Doing it as good as human experts • Additional difficulties • Rapid and real time decision-making and response • Uncertainty due to incomplete and imperfect information • Widely varying operating conditions (no alerts to 100s of alerts per second) • New symptoms and changes in adversary’s strategy

  7. Approach Multi-Layer reasoning • Multi-perspective multi-hypothesis deliberation • Keep all options open– delay the bindings • Divide and conquer • Current-utility as well as potential adversarial counter-response based response selection • A simple “match” is insufficient against intelligent adversary • Unpredictability to counter gaming • Contain while deliberate • Buy time • Learning-based dynamic modification of defense parameters and strategies • “Immunity” against repeats and variants Interpret OLC Select Response ILC Learning

  8. Knowledge Representation and Rules for System-wide Reasoning

  9. Objectives • Represent knowledge of cyber-defense • Allow reasoning about attack and defense, including look-ahead • Automate most reasoning • Encode enough detail to estimate relative goodness of alternatives in most situations • Extract knowledge from Red Team encounters; attempt to generalize • Separate generic, reusable, knowledge from system-specific

  10. Achievements • Classification of knowledge • Classification of reasoning • Breadth-first: • Relationship between alerts, accusations, corruption, flooding, failures • Instantiate for DPASA • Depth-first: • DPASA registration protocol • Run 6, Nov 2005 Red Team exercise • Encode knowledge and reasoning • 1st-order logic prototype • Soar rules and data • Representing concepts, instances and relations– use of a common ontology (Adventium’s Netbase)

  11. Kinds of Knowledge • Symptomatic: possible explanations for a given anomalous event • Both generic and system-specific • Relational: constraints that reinforce or eliminate possible explanations • Mostly system-specific • Teleological: possible attacker goals and actions that may be used to accomplish the goals • Mostly generic • Reactive: possible defensive countermeasures for a given attack • Both generic and system-specific Focus so far has been on 1, 2, and 4

  12. Kinds of Reasoning • Restrictive • From observations of past events and knowledge of system properties, deduce good explanations and good defensive responses • (the reasoning restricts what is possible) • Predictive • Look ahead, comparing alternatives Focus so far has been on restrictive reasoning.

  13. Example from Run 11, Nov 2005 Server 1 (Linux) accusation Server 2 (Windows) accusation: violated protocol accusation Server 3 (Solaris) Server 4 (Linux) Reasoning: Under most likely assumption, no common-mode failure and exploit of at most one OS, Servers 2 and 3 can’t both be lying, so Server 1 must be corrupt. It’s not restartable, so quarantine it. Note that no information source is completely trusted.

  14. (Simplified) Example from Run 6, Nov 2005 Monitor 1 Monitor 2 Monitor 3 Monitor 4 accusations accusation: no heartbeats comm comm Client 1 Client 2 Client2 LAN Reasoning: All 4 monitors claim to have received communication from one client but accuse another client of not delivering heartbeats. They can’t all be lying. The communication path for some must be OK, so either Client 2 or its LAN is bad. Ping Client2 to determine which.

  15. OLC Reasoning Flow

  16. Rapid Prototyping • Use automatic theorem prover • “prover9”, McCune, UNM • 1st order • encode restrictive reasoning • Advantage over Soar: • Existing algorithm for deep reasoning • Easier to get started • Disadvantages compared to Soar: • Goals are not selected automatically • Reasoning algorithm can’t be controlled • Non-1st-order reasoning not available

  17. Encoding in Soar Soar is based on more than 20 years research into human cognition. It uses pattern-directed inference and hierarchical control to reason in a manner similar to human thinking The OLC inference engine will use coherence theory to search for a set of hypotheses that is maximally consistent with the observations and with its experience—we anticipated the need, but our implementation has not yet faced a situation Managing the complexity of knowledge acquisition Use of Herbal to generate Soar rules from higher level representation Use of standard ontology and Protégé

  18. Conclusion and Next Steps • A good start: • Knowledge and reasoning sufficient for defense of DPASA in some Red Team exercises, e.g., run 6 • Rough estimate of coverage: • Existing rules would reason about all alerts and defend successfully in roughly half of Nov 2005 runs in which human operators also defended successfully • 2nd half will be harder • Needed now: • Immediately: rules for flooding; redundant groups; phases of mission • Soon: attacker objectives in larger-scale attacks

  19. Fast Containment Response and Policies

  20. Inner Loop Controller (ILC) Objectives Attempt to contain and correct the problem at the earliest stage possible • Policy Driven: Implement policies and tactics from OLC on a single host. • Autonomous: high speed responsecan work when disconnected from the OLC by an attack or failure • Flexible: Policies can be updated at any time • Adaptive: Use learned characteristics of host and monitored services to tune the policy. • Low impact on mission: able to back out of defensivedecisions when warranted

  21. Survey of ILC Work • Requirements • The threat model, Performance, Range of sensing and response, OLC communications • Design • Study typical applications and recovery needs • Policies • First Prototype • Dynamically configurable rule-based policies • Plans for Integration and Testing • With the testbed emulating the DPASA survivable JBI • As a stand-alone program on real host

  22. ILC Prototype-1 Architecture • Java Driver Program • Instantiate reasoning components, start load • System API • OLC Communications • Sensing and Response • Jess Inference Engine • Policy Modules • For each application and services monitored Java Driver System API (Java+Jess) Jess Rule Engine D A B C D Saved State Files jess facts and rules

  23. Components of ILC Response Internal Objects used in implementing ILC responses. Detection API Monitored Service S Status, Settings Detection Rules for S Problem Types Evidence E Problem Instance P Problem Types and Response Policies Response API internal timers

  24. ILC Status – June 2007 • Requirements and design for ILC • Working Java Driver • Initializes Jess inference engine • Remote access to ILC for policy manipulation or remote debugging • Preliminary System API modules for • ILC embedded in emulated test environment • Standalone ILC for Linux host • Initial ties with learning/adaptation module • Sample policy modules • for SELinux, EFWAgent (Typical defense mechanisms)

  25. Next Steps • Integration with emulated test environment • Flesh out API, make compatible with ontology • Explore interactions with OLC, e.g. strategies involving dynamic ILC policy changes • Complete ties to the learning module • More sample application policies • Explore broader range of behaviors, e.g. nondeterminism • Standalone Testing • Install ILC on workstation and/or server and monitor live applications/services • Probe ILC response under failures and attacks

  26. Improving Defense Parameters and Strategies

  27. Learning Augmentation: Motivation • Why learning? • Extremely difficult to capture all the complexities of the system, particularly interactions among activities • The system is dynamic (static configuration gets out of date) • CSISM will learn to • improve the defensive posture • better knowledge (about the attacks or attacker), better policies • improve how the system responds to symptoms • better connection between response actions and their triggers Adaptation is the key to survival

  28. Development Plan for Learning in CSISM • Responses under normal conditions (Calibration) • Situation-dependent responses under attack conditions • Multi-stage attacks

  29. Analysis: RegTime by Quad Complex domain: human calibration (incorrectly) claimed that Quad 1 was slowest, missing Quad 0 Quad 0&1 are slower than Quads 2&3.

  30. Analysis: Registration Times by Client Type caf_plan, chem_haz and maf_plan are slower than other clients Complex domain: human calibration (incorrectly) claimed that caf_plan & maf_plan were slowest because of hand-typed password, missing chem_haz

  31. Step 1: Calibration • Calibrate the parameters of rules for normal operating conditions • Important first step because it learns how to respond to normal conditions • Initially, timing parameters from ILC, e.g. • Client Registration, PSQ server local probes, SELinux enforcement, SELinux flapping, File integrity checks • Core challenge: Offline Training + Good data + Complex environment - Dynamic system Online Training - Unknown data + Complex environment + Dynamic system Human + Good data - Complex environment - Dynamic system Very hard for adversary from “training” the learner!!! CSISM’s Experimental Sandbox + Good data (self-labeled) + Complex environment + Dynamic system Sandbox approach successfully tried in SRS phase 1

  32. Step 1: Calibration • Using algorithm of Last & Kandel • Calculates a membership score for each sample, based on how similar it is to nearby samples (the distance-to-density ratio). • If score < threshold, it is an outlier • It can make estimates even for multi-modal data. Threshold x x x x x x x x x x x x x x x x x x x x x x Score

  33. Results for CombOps Registration Beta=0.0025 If threshold were 0.90, then x-values inside the green box would be OK Beta=0.005 Beta=0.001

  34. Results for all Registration times Beta=0.0001 Beta=0.0005

  35. Results for all Registration times In the demo, you’ll see these two “shoulder” points, indicating upper and lower limits. Beta=0.0005 As more observations are collected, the estimates become more confident of the range of expected values (i.e. tighter estimates to observations)

  36. Status, Development Plan & Future steps • Responses under normal conditions (Calibration) • Analyze DPASA data (done) • Integrate with ILC (single node) (done) • Add experimentation sandbox (single-node) • Calibrate across nodes • Situation-dependent responses under attack conditions • Multi-stage attacks

  37. Implementation and Integration

  38. Objectives and Assumptions • Objectives • CSISM Components should be reusable and portable • Maximize genericity, and clear demarcation between specific and generics • Standardized representation, generating CSISM internal representations from higher level specification • Evaluation framework should be “system scale”, easy to construct, easy to inject attack effects into, easy to interface with • Emulation • Assumptions • Soar can process alerts as fast as they are generated (not to say that the OLC input will not be flooded) • The survivable system ensures that alerts make it to the OLC and Learner • The survivable system ensures the ILC process runs with higher privilege • If the target is not corrupt, OLC’s command will be executed by the survivable system • Source IP addresses are not spoofed (can be satisfied by the ADF cards) • Challenges Addressed • Standardized representation of concepts, instances and relationships involved in a survivable system • Time handling in reasoning and evaluation • Thread handling in the reasoning engine

  39. Integration Framework

  40. Achievement Summary • OLC • Reasoning about accusations, information flow, and some context and protocol specific situations covering all alerts in half of the DPASA attack runs • A subset of these is exercisable by the emulated testbed, the rest are tested from Soar (apart from rapid prototyping in Prover9) • ILC • Confirmation that reactive response policies for typical defended applications or defense mechanisms can be built from small, reusable rule-based components • Learning Augmentation • Calibration– set up and initial example (e.g., registration time) • Validation framework for CSISM capabilities • Emulation of a subset of ODV survivable JBI implemented, ongoing • Integration • OLC-system under test • Learner-ILC

  41. Next Steps • Challenges/obstacles? • Consistent set of hypotheses • Coherence theory • Plan for next steps in individual tasks • Outlined in earlier sections • Plan for next steps in Integration • KR-work fully integrated with the OLC and system under test • Fuller emulation • ILC- system under test integration • ILC-OLC and Learning-OLC integration • More attack variations and support for red team access • Improved viewport into reasoning and metrics

  42. Conclusion • Good start, gathered momentum • Preliminary results are promising • OLC coverage • ILC feasibility • Learning insights • Cross-project integration potential • Looked into SPDR at more detail • Reasoning about attack plan recognition and OLC bin 3 • ILC and DRED • Same ontological representation • Would like to look into • Other projects, for example: • VICI defense against rootkit to protect the ILC • Other issues (e.g. timeliness) • Of defense • Interference with the timeliness requirements of the system under test • Evaluation vehicle

  43. Backup notes

  44. Enforcement Offno-enforcement.soar • Current: • Interpretation: node reports process-protection off, we note that self accusation • Response selection: enforcement-off self accusation causes blocking all ADF NICs on that host • Next step: • Treat the self accusation generically—many alerts will be “self-accusations”– they will be handled by a single set of rules • Response selection will consider other actions like restarting a process, rebooting a host, blocking the NICs or isolating the LAN

  45. Registrationcallback.soar, prepare-registartion.soar, reboot.soar, gui-up.soar • Observation that a client is invited sets up an expectation (that GUI should appear in the future) • If the GUI does not appear that triggers some interpretation (see below) • Current: • An intermediate condition with a ordered prescription for remedies • Reboot the client: It’s a client issue that rebooting may fix • Re-register from another SM: If there is an SM/DC/AP issue this may solve the problem • If all quads exhausted, try refresh the AP refs and reinvite • If there is a reason to suspect a quad, try isolating that SM before refresh • Future: • Hypotheses that the client or the inviting SM may be bad, or the path may be bad • Restrictive reasoning considering info flow and other incoming events to narrow eliminate • Maximally consistent set of hypotheses • Select response based on utilities (and predictive reasoning)

More Related