Automatic Trust ManagementforAdaptive Survivable Systems(ATM for ASS’s)Howard Shrobe MIT AI LabJon Doyle MIT Lab for Computer Science
A Motivating Example: Background • In the MIT AI Lab, an ensemble of computers runs a Visual Surveillance and Monitoring application. • On January 12, 2001 several of the machines experience unusual traffic from outside the lab. • Intrusion Detection systems report that several password scans and other probes. • After about 3 days of varying levels of such activity, things seem to return to normal • For another 3 weeks no unusual activity is noticed. • Then, a crucial machine (Harding) begins to experience unusually high load averages and the components that run on this machine begin to receive less than the expected quality of service. • The load average, degradation of service, the consumption of disk space and the amount of traffic to and from unknown outside machines continue to increase to annoying levels. • Then they level off.
Grant Harding Load Average ? C1 Potentially Hacked A Motivating Example: The Quandary • On March 2, a high performance machine in the ensemble (Grant) crashes. • The application has been written in a way which allows it to migrate the computations on Grant. • Harding has been behaving oddly and is heavily loaded. • Grant’s computations are critical to the application. • Should the system migrate Grant’s computations to Harding?
Although more loaded than expected, Harding was still the best pool of available resources, • Other machines were even more heavily loadedwith other critical computations of the application. Load Average Load Average Grant Harding Thing1 Thing2 Load Average ? C1 Potentially Hacked Hack isn’t relevant A Motivating Example: Explaining the Decision • The system needed to run the computations somewhere. • Hackers had correctly guessed a user password on Harding; • They had set up a public FTP site containing pirated software • They had not, in fact, gained root access. • There was, therefore, no serious worry in migrating critical computations to Harding
A Different Example • The application was being run to protect a US embassy in Africa during a period of international tension. • We had observed a variety of information attacks being aimed at Harding. • At least some of these attacks are of a type known to be effective in gaining root access to a machine like Harding. • They are followed by a period of no anomalous behavior other than a periodic low volume communication with an unknown outside host. • When Grant crashes, should Harding be used as the backup?
The Explanation • It is likely that an intruder has gained root access to Harding. • It is also likely that the intent of the intrusion is malicious and political. • It is less likely, but still possible, that the periodic connection to an outside host is an attempt to contact a control source for a “go signal” that will initiate serious spoofing of the application. • Under these circumstance, it is wiser to shift the computations to a more trusted machine (Grant) even though it is more overloaded than Harding.
The Core Thesis Survivable systems make careful judgments about the trustworthiness of their computational environment and they make rational resource allocation decisions based on their assessment of trustworthiness.
The Thesis In Detail: Trust Model • It is crucial to estimate to what degree and for what purposes a computational resource may be trusted. • This influences decisions about: • What tasks should be assigned to which resources. • What contingencies should be provided for, • How much effort to spend watching over the resources. • The trust estimate depends on having a model of the possible ways in which a computational resource may be compromised.
The Thesis in Detail: Perpetual Analytic Monitoring • Trust Models depend on having a system for long term monitoring and analysis of the computational infrastructure. • Monitoring must detect complex temporal patterns. • E.g. “a period of attacks followed by quiescence followed by increasing degradation of service” • The monitoring system must assimilate information from: • Self-checking observation points within the application itself • Intrusion detection systems • Firewalls, filtering routers • Other health status indicators
The Thesis in Detail: Adaptive Survivable Systems • The application itself must be capable of self-monitoring and diagnosis • It must know the purposes of its components • It must check that these are achieved • If these purposes are not achieved, it must localize and characterize the failure • The application itself must be capable of adaptation so that it can best achieve its purposes within the available infrastructure. • It must have more than one way to effect each critical computation • It should choose an alternative approach if the first one failed • It should make its initial choices in light of the trust model
The Thesis in Detail: Rational Resource Allocation • This depends on the ability of the application, monitoring, and control systems to engage in rational decision making about what resources they should use to achieve the best balance of expected benefit to risk. • The amount of resources dedicated to monitoring should vary with the threat level • The methods used to achieve computational goals and the location of the computations should vary with the threat • Somewhat compromised systems will sometimes have to be used to achieve a goal • Sometimes doing nothing will be the best choice
The Active Trust Management Architecture Self Adaptive Survivable Systems Trust Model: Trustworthiness Compromises Attacks Perpetual Analytical Monitoring Rational Decision Making Trend Templates System Models & Domain Architecture Other Information Sources: Intrusion Detectors Rational Resource Allocation
The Nature of a Trust Model • Trust is a continuous, probabilistic notion • All computational resources must be considered suspect to some degree. • Trust is a dynamic notion • the degree of trustworthiness may change with further compromises • the degree of trustworthiness may change with efforts at amelioration • The degree of trustworthiness may depend on the political situation and the motivation of a potential attacker • Trust is a multidimensional notion • A System may be trusted to deliver a message which not being trusted to preserve its privacy. • A system may be unsafe for one user but relatively safe for another. • The Trust Model must occupy at least Three tiers • The Trustworthiness of each resource for each specific purpose • The nature of the compromise to the resources • The nature of the attacks on the resources • Most work has only looked at attacks (intrusion detection).
Tiers of a Trust Model • Attack Level: history of “bad” behaviors • penetration, denial of service, unusual access, Flooding • Compromise Level: state of mechanisms that provide: • Privacy: stolen passwords, stolen data, packet snooping • Integrity: parasitized, changed data, changed code • Authentication: changed keys, stolen keys • Non-repudiation: compromised keys, compromised algorithms • QoS: slow execution • Command and Control Properties: compromises to the monitoring infrastructure • Trust Level: degree of confidence in key properties • Compromise states • Intent of attackers • Political situation
Perpetual Analytical Monitoring (MAITA) • Collects evidence from a broad variety of sources • Intrusion detection systems • Network monitors • Firewalls • Self monitoring application systems • Filters, aggregates, correlates and conditions the data • Matches these against a knowledge base of trend templates • Templates represent temporal patterns indicative of the “etiology of a disease” • Degree of match is indicative of the likelihood of compromise • Sends alerts to running applications in the case of “alarm situations” • Can be directed to increase/decrease its activity to: • Bolster trust in a desirable resource • Carefully monitor a potentially “dicey situation” • Free “computrons” for more important activities
What is a Trend Template? • A Temporal Pattern characteristic of etiology • Key Features: • Landmark points • Temporal constraints between Landmarks • Intervals bounded by Landmarks • Value Constraints relating Variables within the intervals • Trend templates represent a conditional probability of the conclusion given a match to the template • Probabilistic inference also depends on the degree of fit to each trend template • More than a single template may match the data • Trend templates are used to recognize compromised resources after the fact. • Trend templates are also used to recognize attacks • Some attacks cannot be expressed without them
Quiet Trend Templates Increasing Attacks and Probes Variable but High Rate of Attacks and Probes Dropping Rate Of Attacks Increasing Disk Use, Increasing External Access Compromise: Stolen Password Exposed FTP Site 1 - 3 days 1 - 3 days 1 - 3 hours 1 - 3 weeks 1 - 3 weeks
Component Asset Base Foo Foo B 1 2 3 A A B 1 2 3 C 1 2 3 Self Monitoring B A Condition-1 Condition-1 Adaptive Survivable Systems Diagnosis & Recovery Rational Selection Diagnostic Service To: Execute Foo Method 3 Is most Attractive 1 2 3 Repair Plan Selector Super routines alerts Layer1 Layer2 Rollback Designer Layer3 Synthesized Sentinels Plan Structures Resource Allocator Post Condition 1 of Foo Because Post Cond 2 of B And Post Cond 1 of C PreReq 1 of B Because Post Cond 1 of A Enactment Development Environment Runtime Environment
How to Build Adaptive Survivable Systems • Make Systems Fundamentally dynamic • Systems should have more than one way to achieve any goal • Always allow choices to be made (or revised) late • Inform the runtime environment with design information • Systems should know the purposes of their component computations • Systems should know the quality of different methods for the same goal • Make the System responsible for achieving its goals • Systems should diagnose the failure to achieve intended goals • Systems should select alternative techniques in the event of failure • Build a trust model by pervasive monitoring and temporal analysis • Optimize dynamically in light of the trust model • balance between quality of goal achieved vs risk encountered
Dynamic Rational Component Selection • Systems have more than one method for each task. • Each method specifies • Quality of Service provided • Resources Consumed • Likelihood of success • Likelihood of success is updated to reflect current state of trust model • Select the method with greatest Expected Net Benefit • Generalizes “Method Dispatch” Replace notion of “Most Specific Method” By that of “Most Beneficial Method”
Informing the Runtime with Design Info • Plan Structures • Goals • Invariants • Dependencies • Dispatching • Alternative Methods for common tasks • Applicability conditions for alternatives • Decision Criteria for selecting methods • Active Sentinels • Monitors for expected conditions • Data Collectors for runtime statistics • Data on long term failure rates
Diagnostic Service B alerts Repair Plan Selector A Condition-1 Rollback Designer Condition-1 requires prerequisite achieves Concrete Repair Plan Resource Allocator Monitor Resource Plan Enactment Making the System Responsible for Achieving Its Goals Localization & Characterization Scope of Recovery Selection of Alternative
Structural Model/Pattern Statistical Profile The Space of Intrusion Detection UNSUPERVISED LEARNING FROM NORMAL RUNS Model of Expected Behavior. Discrepancy from Good Symptom Anomaly Suspicious Violation Match to Bad HANDCODED STRUCTURAL MODELS OF ATTACKS SUPERVISED LEARNING FROM ATTACK RUNS A symptom may indicate an attack or a compromise
15 25 25 20 15 Model Based TroubleshootingGDE Times 3 Plus 40 40 5 Times 5 5 Plus 40 35 Times 3 Conflicts: Blue or Violet Broken Diagnoses: Green Broken, Red with compensating fault Green Broken, Yellow with masking fault
Adding Failure Models • In addition to modeling the normal behavior of each component, we can provide models of known abnormal behavior • Each Model can have an associated probability • A “leak Model” covering unknown failures/compromises covers residual probabilities. • Diagnostic task becomes, finding most likely set of models (one for each component) consistent with the observations. • Search process is best first search with joint probability as the metric • Generate next most likely diagnoses if interested Delay:2,4 Component2 Normal: Delay: 2, 4 Probability 90% Delayed: Delay 4, +inf Probability 9% Accelerated: Delay -inf,4 Probability 1%
Conditional probability = .2 Normal: Delay: 2,4 Delayed: Delay 4,+inf Accelerated: Delay -inf,2 Normal: Probability 90% Parasite: Probability 9% Other: Probability 1% Conditional probability = .4 Conditional probability = .3 Has models Has models Node17 Component 1 Located On Moving to a Bayesian Framework • The model has two levels of detail specifying computations, the underlying resources and the mapping of computations to resources • Each resource has models of its state of compromise • The modes of the resource models are linked to the modes of the computational models by conditional probabilities • The Model can be viewed as a Bayesian Network
Final Model Probabilities Hacked Hacked Hacked Normal Normal Resource Posterior Prior Posterior Prior Trader-Joe .324 .300 .676 .700 Bonds-R-Us .207 .200 .793 .800 JPMorgan-Net .450 .150 .550 .850 WallSt-Server .267 .100 .733 .900 Computation Mode Probability Web-Server Off-Peak .028 Peak .541 Normal .432 Dollar-Monitor Slow .738 Normal .262 Yen-Monitor Slower .516 Slow .339 Normal .145 Bond-Trader Slow .590 Fast .000 Normal .410 Currency-Trader Slow .612 Fast .065 Normal .323
Work Plan • New start July 1, 2000 • Tasks in Base Effort • Trust Models: Ontologies of Attacks, Compromises, Intentions, and Trust Positions • Perpetual Analytic Monitoring: Trend templates dealing with compromises, informed by IT systems and self-monitoring applications • Rational Trust Management: Decision theoretic models and algorithms for allocating resources to computations • Test bed for above • Tasks in Options • Self-Adaptive Application Infrastructure: Synthesis of monitors, diagnostic techniques,
Major Risks • How to guard against monitoring infrastructure itself becoming compromised or denied service • Comprehensiveness of the ontologies (attacks, compromises, trust states, etc.) • Decision making in real-time (method selection, resource allocation)
Mitigations • In principle, much of the other technology could be used to harden our infrastructure • Our techniques for adaptivity can be applied to our infrastructure • Other projects are also working on cataloging the same knowledge, broaden our list of collaborators • Replace explicit decision theoretic techniques by qualitative analogs and by switching between rule-based policies that approximate decision theoretic conclusions.
Milestones • Trust models • Publish ontology of attacks, compromises etc • Perpetual Analytic Monitoring • Demonstrate trust monitoring library • Demonstrate reconfiguration of monitoring infrastructure • Final exam in Testbed • Rational Trust Management • Demonstrate initial models and algorithms • Demonstrate capability in high-frequency real-time environment • Options: • Demonstrate initial prototype of Adaptive System infrastructure which utilizes the initial trust model ontology. • Demonstrate prototype of integrated Adaptive System development and runtime environment, including all aspects of intrusion tolerance. • Final exam in testbed
Testing Strategy • Deploy small application in the AI Lab environment • Open university environment • Subject to frequent hackery • Use this as Test Bed • Incrementally deploy research techniques in test bed • Measure effectiveness • Against usual background • Against intentional, staged attacks (by us and our friends)