Scalability, Accountability and Instant Information Access for Network Centric Warfare

Scalability, Accountability and Instant Information Access for Network Centric Warfare Yair Amir (PI), Claudiu Danilov, John Lane, Jonathan Shapiro, Ciprian Tutu Department of Computer Science Johns Hopkins University Cristina Nita Rotaru Department of Computer Sciences Purdue University http://www.cnds.jhu.edu

Network Centric Warfare Environments • Wide area network settings. • C3I systems usually span large geographical distances. • Communication between sites is conducted over unreliable channels. • Timely decisions based on available information. • Required update semantics are not general in many cases • Critical information is often not large. • Source uniqueness.

Network Centric Warfare Environments • Wide area network settings. • Timely decisions based on available information. • Intermittent network connectivity • Results in high latency for propagation and for consistent replication of updates. • Decisions may have to be made promptly. • Based on the best currently available information. • Required update semantics are not general in many cases • Critical information is often not large. • Source uniqueness.

Network Centric Warfare Environments • Wide area network settings. • Timely decisions based on available information. • Required update semantics are not general in many cases. • Weaker update semantics may suffice. • Common operation picture: • Commutative update semantics. • Timestamp resolution (most recent update wins). • Critical information is often not large. • Source uniqueness.

Network Centric Warfare Environments • Wide area network settings. • Timely decisions based on available information. • Required update semantics are not general in many cases • Critical information is often not large. • Compared with current hardware capabilities. • Location of friendly forces and enemy forces. • A few plans. • Allows storing all updates throughout the duration of engagement (several months). • Source uniqueness.

Network Centric Warfare Environments • Wide area network settings. • Timely decisions based on available information. • Required update semantics are not general in many cases • Critical information is often not large. • Source uniqueness. • Every input (update) is initiated by one unique source.

Network Centric Warfare Environments • Wide area network settings. • Timely decisions based on available information. • Required update semantics are not general in many cases • Critical information is often not large. • Source uniqueness.

Malicious Insider Threats • Insiders: participants with legitimate access or those that bypassed the protection mechanisms and exhibit arbitrary (malicious) behavior. • The insider attack has traditionally been a primary threat to computer systems. ( http://csrc.nist.gov ). • The explosion of the Internet made things worse: Insiders commit about 80% of all computer and Internet related crime (www.intergov.org) and CSI/FBI 2003 Computer Crime and Security Survey.

Dealing with Insider Threats • Detection: use intrusion detection systems; however, they are not perfect (high false positives rate). • Prevention: use access control, firewalls, proactive security; but vulnerabilities still exist (OS bugs, buffer overflow, cover channels, etc). • Mitigation(tolerate/cope): use mechanisms that provide service to correct participants while under attack, even if several participants are compromised. • The above methods do not exclude each other.

Outline • Network centric warfare environments. • Peer Byzantine replication limitations. • Research approach. • Scaling wide area intrusion tolerance replication via hierarchy • Local Byzantine replication within sites. • Fault tolerant replication on the wide area. • Client accountability. • Accountability graph. • Snapshots for fast regenerations. • Exploiting application semantics. • Next steps. • Technology transitioning. • Summary.

A Distributed Systems Service A site • Message-passing system. • Clients issue requests to servers, then wait for answers. • Replicated servers process the request, then provide answers to clients. Clients Server Replicas o o o 3f+1 1 2 3

State Machine Replication • Requests must be ordered in a consistent manner by all servers. • Usually one server manages the ordering process based on information from the other participants, then informs everybody about what was decided. • If the leader dies, a new leader must be selected to ensure progress. • Benign faults: Paxos [Lam98,Lam01]: must contact f+1 out of 2f+1 servers and uses 2 rounds to allow consistent progress. • Byzantine faults: BFT [CL99]: must contact 2f+1 out of 3f+1 servers and uses 3 rounds to allow consistent progress.

A Replicated Server System • Maintaining consistent servers [Sch90] : • To tolerate f benign faults, 2f+1 servers are needed. • To tolerate f malicious faults: 3f+1 servers are needed. • Responding to read-only clients’ request [Sch90] : • If the servers support only benign faults: 1 answer is enough. • If the servers can be malicious: the client must wait for f +1 identical answers, f being the number of malicious servers.

Peer Byzantine Replication Limitations • Limited scalability due to multiple all-peer exchange. • 3-round all-peer exchange. • Very costly on high latency wide area links. • Not very scalable. • Strong connectivity is required. • Construct consistent total order. • Focus is solely on replica protection.

Peer Byzantine Replication Limitations • Limited scalability due to multiple all-peer exchange. • Strong connectivity is required. • 2f+1 (out of 3f+1) to allow progress and f+1 to get an answer. • Partitions are a real issue. • Clients depend on remote information. • Bad news: Provably optimal. • We need to pay something to get something else. • Construct consistent total order. • Focus is solely on replica protection.

Peer Byzantine Replication Limitations • Limited scalability due to multiple all-peer exchange. • Strong connectivity is required. • Construct consistent total order. • Agreement is achieved on the order of updates before applying them. • Very useful - supports general update semantics. • Maybe sub-optimal for C3I applications that need only commutative semantics. • Focus is solely on replica protection.

Peer Byzantine Replication Limitations • Limited scalability due to multiple all-peer exchange. • Strong connectivity is required. • Construct consistent total order. • Focus is solely on replica protection. • Compromised clients can inject wrong (though valid) input through authorized channels. • Wrong input will be consistently replicated to all servers.

Local Byzantine Replication Within a Site • No trust between participants in a site • A site acts as one unit that can only crash if the assumptions are met. • How to make sure that one server can not manipulate the order? • Threshold cryptography seems a good direction. • Use BFT-like [CL99, YMVAD03] protocols and threshold cryptography to guarantee that any valid message leaving the site is correct.

Reg Prim Trans Prim Exchange States Non Prim Un No Construct Fault Tolerant Replication Engine Update (Green) Update (Yellow) update (Red) Trans Memb Reg Memb Reg Memb Trans Memb Last State Reg Memb 1a 1b ? 0 Update Reg Memb No Prim or Trans Memb Exchange Messages Last CPC Last CPC Recover Trans Memb Possible Prim [AT02]

Fault Tolerant Experiments over Wide-Area Network Boston MITPC • A real experimental network (CAIRN). • Was also modeled in the Emulab facility. Delaware 4.9 ms San Jose 9.81Mbits/sec UDELPC TISWPC 3.6 ms 1.42Mbits/sec ISEPC 1.4 ms 1.47Mbits/sec ISEPC3 100 Mb/s <1ms 38.8 ms 1.86Mbits/sec ISIPC4 Virginia ISIPC 100 Mb/s < 1ms Los Angeles

Throughput Comparison (WAN) [ADMST02]

Hierarchical Architecture A site Clients • Each site acts as a logical unit that can crash. • Fault-tolerant protocols between sites. Server Replicas o o o 3f+1 1 2 3

Hierarchical Architecture Details Local Site Local Site Clients Clients Local area network Local area network Server Replica 1 Server Replica 1 Server Replica 2 Server Replica 2 Server Replica 3f+1 Server Replica 3f+1 Byzantine Byzantine Byzantine Byzantine Byzantine Byzantine Replication Replication Replication Replication Replication Replication o o o o o o Fault Tolerant Fault Tolerant Fault Tolerant Fault Tolerant Fault Tolerant Fault Tolerant Replication Replication Replication Replication Replication Replication Monitor Monitor Monitor Over Over Over Over Over Over Secure Spread Secure Spread Secure Spread Secure Spread Secure Spread Secure Spread Wide area representative Wide area representative Wide area standby Wide area standby Wide area standby Wide area standby Wide area network Wide area network

Payment & Potential Gain • Protects against f Byzantine faults in each site for the priceof having 3f+1 replicas in every site. • Box numbers / a total site compromise. • Read queries are limited to the local site. • On a network with diameter of 50 ms. • It takes at least 300 milliseconds to complete 3 wide area round trips used by peer Byzantine replication methods. • FT Replication engine was shown to be achieve 5 times the performance of 2PC. • Goal • > factor of 3 compared with a peer system.

Alternative Scalable Architecture • Use physical trusted nodes assumed to be working under a weaker adversary: can crash and recover, but can not be compromised. • Take advantage of the trusted nodes to run an optimized Byzantine replication algorithm, potentially reducing the number of rounds. • Use protocols where communication over WAN only take place between trusted nodes, thus avoiding high-latency. • Similar approaches: [CLNV02, Ver03, SurS03]

A - DAG What About Corrupted Clients? • We can not detect corrupted clients without external information (can take advantage of detection mechanisms). • Can we bring the system to a “clean” state if we have external information about compromised clients? • Proposed solution: accountability graph.

Client Accountability Graph • A direct acyclic graph of updates. • Each update links to previous updates modifying data it read (causal predecessors). Time Client Update

Client Accountability Graph • Limits adversary power: • Adversary can inject updates only as a compromised client. • Once a compromised network avoids delivering an update, it cannot deliver causally following updates. • Useful for risk assessment. Time X Corrupted update Suspicious update Clean update

Enabling Fast Regeneration Using Snapshots • Periodic snapshots limit state regeneration calculation. • For our application domain, it seems feasible to maintain continuous information of a long period of time Time X Most recent snapshot Corrupted update Suspicious update Clean update

Server Replica 1 Server Replica 2 Server Replica 2 Server Replica 3f+1 Server Replica 3f+1 A - DAG A A - - DAG DAG A A - - DAG DAG Byzantine Byzantine Byzantine Byzantine Replication Replication Replication Replication o o o o o o Fault Tolerant Fault Tolerant Fault Tolerant Fault Tolerant Replication Replication Replication Replication Over Over Over Over Secure Spread Secure Spread Secure Spread Secure Spread Wide area representative Wide area standby Wide area standby Wide area standby Wide area standby Overall Architecture Local Site Local Site Clients Clients Local area network Local area network Server Replica 1 A - DAG Byzantine Byzantine Replication Replication Fault Tolerant Fault Tolerant Replication Replication Monitor Monitor Monitor Over Over Secure Spread Secure Spread Wide area representative Wide area network Wide area network

Risks and Challenges • Interface the Byzantine-tolerant replication and Fault-tolerant replication components. • Investigate the impact of threshold digital signatures on performance and complexity. • Interface Byzantine-tolerant replication with the client accountability graph. • Use of application semantics to optimize protocols. • Design optimizations to make the cost of the architecture very small when no faults occur. • Take into account confidentiality under corrupted servers model.

Scalability, Accountability and Instant Information Access for Network-Centric Warfare New ideas First scalable wide-area intrusion-tolerant replication architecture. Providing accountability for authorized but malicious client updates. Exploiting update semantics to provide instant and consistent information access. Impact Schedule Resulting systems with at least 3 times higher throughput, lower latency and high availability for updates over wide area networks. Clear path for technology transitions into Military C3I systems. System integration & evaluation Component analysis & design Comp. eval. Component Implement. C3I model, baseline and demo Final C3I demo and baseline eval June 04 Dec 04 June05 Dec 05 http://www.cnds.jhu.edu/funding/srs/

Scalability, Accountability and Instant Information Access for Network Centric Warfare