Computational Resiliency Steve J. Chapin, Susan Older Syracuse University Gregg Irvin Mobium Enterprises OASIS PI Meeting
Recap: What isComputational Resiliency? The ability to sustain application operation and dynamically restore the level of assurance during an attack. Application-centric self defense, built on replication, migration, functionality mutation, and camouflage.
ComputationalResiliency Techniques applied to correct situation Attack Result of Attack Degraded Application trying to perform Mission CriticalFunction Mission CriticalApplication Degraded Application sufficiently Improved by Resiliency to perform Mission Critical Function Computational Resiliency
Example of CRLib “Safe Zone” OASIS protection “The Wild” limited protection
The Players • Rocky & Bullwinkle: our heroes, both air and ground forces. • Dudley: representative of allied power. • Boris & Natasha: Directed by shadowy figure (Fearless Leader). Mission: big trouble for Moose and Squirrel. • Snidely: attempting to disrupt Dudley’s jobs.
The Benign State Rocky’s job Bullwinkle’s job Dudley’s job (low priority)
The Attacks Snidely: blocked at firewall Dudley does nothing.
The Attacks Natasha attacks Rocky; caught by IDS.
The Attacks Rocky’s job migrates back into safe zone; Dudley must give up resources.
The Attacks Boris attacks Bullwinkle’s job. Some attacks succeed.
The Attacks Bullwinkle’s job employs camouflage, decoys, and migration.
Multi-Faceted Approach • Strong theoretical basis • reason about conformance to policy • Computational resiliency library • dynamic application management • System software support • scheduling/policy frameworks
Computational ResiliencyLibrary • Group messaging • group contains multiple nodes • all nodes receive all messages to group • Replication/recovery with migration • liveness check at synchronization points • application readiness restored via node creation and migration
Group 1 Group 3 node channel Group 2 Groups and Messaging One group per cooperating task in a distributed computation.
Group 1 Group 2 Group Messaging Detail In actuality, each member of Group 1 has a channel to each member of Group 2.
Nodes of group mapped across processors Multiple nodes as threads in a single process One or more processes per processor Mapping of Nodes to Processors (channels not shown) Group Processor
Periodic Liveness Check • Done at user-defined synchronization points in the computation • All group members send ping messages to all others in the same group • Local Group Leader (1 per group) elected (responsible for restoring intra-group replication level) • LGLs elect Global Group Leader (responsible for inter-group coordination)
Periodic Liveness Check II • LGLs determine local status by fiat, restore replication level, and report to GGL • create new threads via cloning LGL • consensus option is in place but currently unused • GGL reports results of LGL actions to other LGLs. • LGL and GGL return to normal duty
Current Issues • Exploring through in-house red teaming and modeling • Efficiency of basic mechanisms • multiplicative communication load • additive computation load • Efficacy of basic mechanisms • Window of attack between liveness checks • Attack during liveness check • agreement algorithms
Next Steps • Additional policy choices • agreement protocols • replication/recovery methods • message passing schemes • Tool for user policy expression • state-dependent policy specified via “chinese menu” approach • logical predicates, state transitions } Not necessarily orthogonal choices
Next Steps • -calculus-based formal model for core library behavior • Split/merge for groups • all nodes in a group must be identical • basis for load balancing, functionality mutation • First demo at summer PI meeting, 2001
6/00 12/00 6/01 12/02 6/02 12/02 6/03 12/03 Schedule Basic -calc Formal equivalence Policy/ Protocol Analysis Basic CRLib
6/00 12/00 6/01 12/02 6/02 12/02 6/03 12/03 Schedule II Funct. Mut. Policy Frameworks Camouflage Schedulers Hard. Apps. Integration Demos
Open Issues • Cost/benefit analysis of CR • how much protection do we provide if the attacker knows what we’re trying to do? • How much is performance affected by message load, active replication, etc. • Potential integration with other OASIS • complementary with system-hardening technology (e.g., Dependable Intrusion Tolerance)