Intrusion Tolerant Distributed Systems – Algorithms and Architectures

Intrusion Tolerant Distributed Systems – Algorithms and Architectures Software Systems Research Seminar March 21, 2003 Angelo Corsaro & Venkita Subramonian DOC Group, Washington University

Most of secure systems are nowadays built by trying to prevent attacks Several techniques and tools have been developed to make more secure systems, detect system weakness, and protect systems New Programming Languages Software Tools like code analyzer, system profiler, etc. New Hardware/Software components etc. etc. Yet, systems’ security keeps being compromised!!! Nowadays pervasive interconnectivity introduces more challenges for security The lesson learned in securing systems is that this brute force approach does not work. Experience has led to the key observation that it isn’t practical/feasible to build 100% secure systems Security: State of the Art

Classical Secure distributed systems are based on the assumption that there exist part of the system which is trusted The basic and recurrent idea is that of connecting distributed components together so as to form a global secure infrastructure This approach requires large trusted parts on all computers on the network Classical Secure Distributed Systems

One of the most used and deployed distributed security systems is Kerberos It was designed and implemented at the MIT as part of the Athena project The core assumption at the base of Kerberos’s design are the following: Client workstations are totally under control of the user, i.e., can’t be trusted Remote services can be accessed only via an authentication service Servers are trusted, and are physically protected The servers are under the complete control and responsibility of the administrator The master server is replicated on passive slaves, which can replace the server when it fails Kerberos

Ticket Granting Service Kerberos Client Server Kerberos

Request for a TGS ticket Ticket Granting Service Kerberos Client Server Kerberos 1

Request for a TGS ticket Ticket for TGS Ticket Granting Service Kerberos Client Server Kerberos 2 1

Request for a TGS ticket Ticket for TGS Request for Server Ticket Ticket Granting Service Kerberos Client Server Kerberos 3 2 1

Request for a TGS ticket Ticket for TGS Request for Server Ticket Server Ticket Ticket Granting Service Kerberos Client Server Kerberos 3 2 4 1

Request for a TGS ticket Ticket for TGS Request for Server Ticket Server Ticket Request for Service Ticket Granting Service Kerberos Client Server Kerberos 3 2 4 1 5

The security administrator can misuse his privileges to performs unauthorized actions Replicas (Kerberos uses passive replication) can also provide information to intruders if not well protected If Kerberos server fails, the last DB changes are lost Nothing is done to prevent “covert” channels There is a single point of failure!!! Kerberos’s Security Problems

Eliminating flaws that make systems un-secure is not feasible (especially for legacy systems) Currently adopted solutions for distributed systems’ security have quite a few problems How about building systems that can continue critical operations in face of attacks? Can we build systems that instead of trying to prevent attacks can instead tolerate them? Security: New Trends

Architectures for Intrusion Tolerance

Intrusion Tolerant Systems are designed in such a way that they can tolerate a bounded number of misuses If one or more intruders by-pass the protection mechanism and if the number of misuses they do is less than a given threshold, the security properties of the system: Confidentiality Integrity Availability Are always ensured!!! The key observation at the basis of Intrusion Tolerant systems is that an intrusion can be though as a Byzantine Fault Intrusion Tolerance: The Idea

Confidentiality: Read access to a subset of confidential data gives no information about the data Integrity: The change of a subset of data does not change the data perceived by legitimate users Availability: The change or deletion of a subset of data or of a server does not produce a denial of service to legitimate users For each property Pis defined a threshold Tp The reading, modifying or destroying a part Xof the data or server D such that |X| < T Types of Intrusion Tolerance |X|< T Intrusion

Intermezzo

Data intrusion-tolerance techniques have existed for a long time Confidentiality can be ensured by cryptographic tools like the threshold scheme The data is shared in shadows, each shadow being stored on one security site To build the data it is sufficient a number of shadows called the threshold This scheme ensures availability and integrity To prevent denial of service the server are replicated Different sites cannot take decision independently, they must agree by communicating data and local decisions This last point requires replication and agreement Data Intrusion Tolerance Site Y Site X Site Z File A File B File C

Intrusion Tolerant Security Service

The goal of an Intrusion Tolerant Distributed Security server is that of providing a trusted service out of a set of potentially untrusted computers This way, the intrusion of one of some of the computers won’t compromise the security of the global system All the sites that are part of the security service, called security sites, have to provide a series of services: Registration Authentication Sensitive Data Management Audit and Recovery Service Intrusion Tolerant Security Server

The registration permits a user to be registered by the system for future access to secured services This operation must be carried out independently on each security site to prevent a single site from using information to impersonate the user The operation is done under control of the security administrator of each site Registration Service

The role of this service is to verify the claimed identity of a subject In a distributed system with several authentication servers, each server must independently authenticate the subject Notice that the security sites are untrusted and one site could fake the authentication information An agreement protocol is used to make sure that the user is authenticated if a majority of server succeeded Upon authentication the server sends the user some session information, such as session id, key etc. Authentication Service

The role of the authorization service is that of checking that the access to a secured service by a subject is authorized according to its access-rights Access rights could be implements in a UNIX-like manner The authorization service is made intrusion tolerant by implementing it on security servers Authorization phases are: The client asks the security server for permission to access a secured service The access rights stored on the security sites allow to determine if the client has the proper rights The security sites vote to decide if the access is authorized If the sites agree to permit access they send a ticket to the client, and another to the server Using the ticket the client can now open a session with the server Authorization Service

The role of this service is to store, manage and retrieve the sensitive information on the security servers The data management service must enforce the three main security properties Confidentiality Integrity Availability Integrity property is provided by a modification detection mechanism based such as cryptographic signatures Replication can be used to ensure availability, while threshold techniques could be used for confidentiality and availability Sensitive Data Management Service

If data is replicated on N sites, then With respect to availability, up to N-1 replicas can be lost With respect to confidentiality, one replica is sufficient to observe the data If one data item is shared on N security sites using a threshold of T, then With respect to availability, N-T shadows can be lost With respect to confidentiality, T shadows are necessary and sufficient to observe the data Sensitive Data Management Service

The role of this service is to audit the security information sent by the services There exists two kind of information Authorized operations Attempted or successful intrusion or misuse Notice that it is not a role of the service that of determine what constitutes an intrusion or a misuse Analysis of the audit is done offline by security administrators The recovery service acts as an error recovery mechanism to correct certain modified data The Audit and Recovery Service

Voting Algorithms for Intrusion Tolerance

Need for voting algorithms Authentication Authorization

FT Node architecture P1 P2 P3 P1 P2 P3 P1 P2 P3 Bus Controller Bus Controller Bus Controller Local broadcast medium Cluster1 Cluster2 … Cluster3

Distributed Voting • Two phases • Local Computation • Compute results locally and broadcast results • Majority reconciliation • Determine if majority exists • Initiate fault diagnostics if necessary • Distributed algorithm for both phases • Coordinator commits the majority vote

Phase2(1/2) Distributed algorithm that runs on every voter Receive result from all voters If my result same as all other results we have a unanimous vote commit vote Else if we have more than 50% of the results the same we have a majority if I am the coordinator and my result NOT same as majority result select a new coordinator from among the “majority processors” commit vote if I am the coordinator initiate fault recovery in minority nodes (continued…)

Phase2(2/2) Else we do not have a majority start local diagnostics if my status = “okay” select new coordinator from among “okay” processors repeat voting process

Choosing a new coordinator • New coordinator chosen from a processor set • Candidate processor set • could be all processors, when there is no majority • or set of processors belonging to the majority Check local node status If status = “okay” broadcast status to other processors wait until broadcast from other processors arrive if my node has the largest node id among “okay” processors I declare myself new coordinator

Committing a Vote • Coordinator responsible for committing majority vote If I am the coordinator broadcast result to majority wait for ack from all processors in majority Else wait for result from coordinator send ack to coordinator

Problems with 2 Phase protocol • What if coordinator fails right before committing majority vote? • User (client) will receive bad result • Probability very less • Within acceptable risk parameters • But transient faults could have adverse effect on security • An attacker could control what result a user sees • Majority does not matter any more

Security and transient faults • Transient faults could hamper security • Illuminating a single transistor in an IC using a laser • Serious threat to Smartcard technology • Attack invented and perfected by Sergei Skorobogatov, Cambridge University • “Sergei's work will trigger a generation change in smartcard technology. The immediate effect of his work is that many attacks on computer systems that were developed as theoretical possibilities by the research communities in the 1990s have suddenly become practical” • – EE Times, May 2002

A Solution • Algorithm by Castro and Liskov 2 2 voter voter voter 3 3 1 Client • Pros • Commit done by all voters as opposed to just one coordinator, hence more secure than the 2-Phase algorithm • Cons • Does not scale well, since client has to wait for f+1 replies

Other algorithms • More algorithms in literature • Reiter, M., “The Rampart Toolkit for Building High-Integrity Services,” Theory and Practice in Distributed Systems,Lecture Notes in Computer Science 938, pp. 99-110. • Malkhi, D., Reiter, M., “Byzantine Quorum Systems,”Proceedings of the 29th ACM Symposium on Theory of Computing, May 1997. • Kihlstrom, K., et al., “The SecureRing Protocols for Securing Group Communication,” Proceedings of the 31st Hawaii International Conference on System Sciences, Vol. 3, pp. 317-326, Jan 1998. • Deswarte, Y., et al. “Intrusion Tolerance in Distributed Computing Systems,” Proceedings of the 1991 IEEE Symposium on Research in Security and Privacy, pp. 110-121, May 1991.

Inexact voting • Drawbacks to the previous algorithms • Assumes state machine replication in all voters • Two different non-faulty voters will produce the same result • Some use-cases where this assumption does not hold • E.g., sensor values • Inexact voting • Values that fall within a range of tolerance are considered equal • Equivalence classes • Algorithms can be modified to handle inexact voting • BUT, performance overhead large for multiple inexact comparisons to determine majority

Proposed Algorithm Assumptions • Network with • Atomic broadcast capability • Bounded message delay • Fair-sharing of broadcast medium • No voter will commit answer until all voters ready • Enforced using application dependent thresholds • Any commits before this threshold are considered invalid • Majority of voters are fault-free for reliable working of the system • Each voter can vote only once • Enforced by the User Interface module

Proposed Algorithm (1/2) voter voter voter 2 2 1 Interface Module Client 3 1. Commit, if not committed already 2. Compare with committed result 3. Timer expires, send result to client

Proposed Algorithm (2/2) 3 3 voter voter voter 2 2 1 4 Interface Module Client 5 1. Commit, if not committed already 2. Compare with committed result 3. Dissent, if no match 4. Commit new vote 5. Reset timer expiry

Uniqueness of this algorithm • Security increased • No specific coordinator node – hence reduced vulnerability • Even if the first commit to User Interface module is compromised, it gets invalidated by dissenting voters • “Denial of Service” (vote-rigging) eliminated since a vote from an already committed voter is ignored • Fault-tolerance properties maintained as before • Result still based on majority • Concerns about the User-Interface module • Single point of failure • BUT, this module is very simple with very little computation • User-Interface module can be isolated from the voter complex • Less intensive computation on the client • Does not have to reconcile all results from voters

Authentication • Voters must be authenticated by User Interface module before accepting commits • This should not increase the complexity of the module • Strong authentication with minimal interaction between voters and the interface module preferred • Example mechanism • Use SKEY authentication

vote vote vote fn(R) fn-1(R) f(R) SKEY authentication scheme Voter Interface Module R … R’ f is a one-way function

Distributed voting in WAN • Centralized voting not appropriate in a WAN setting • Multiple hops for vote to reach from voter to coordinator • Link failures could partition the network • Network congestion in the vicinity of the coordinator • Inexact voting could be computationally very intensive • Sensor data from a vast coverage area • Single coordinator target for malicious attack

Assumptions • Reliable transport • Messages are digitally signed and subject to verification before delivery to upper layer • Unverifiable messages are discarded • Presence of Public-Key infrastructure • Every voter knows the public key of every other voter

Secure voting 1 1 voter voter voter 4 2 2 3 3 1. Send signed vote to other voters, hash the result and save it 2. Verify sign and compare with own result 3. Hash sender’s result, sign it and send endorsement back 4. Verify the endorsement and compare it with saved value in step 1

Performance • Time complexity • Each voter signs its result and broadcasts it - O(1) • Each voter waits to receive one signed vote from every other voter – O(n) • Each voter does vote comparison – O(1) • Each voter receives an endorsement from every other voter – O(n) • Complexity is O(n) • Number of messages • Voter sends vote to every other voter – n(n-1) • Voter sends endorsement to every other voter – n(n-1) • O(n2)

The Intrusion Tolerance mechanism described provide a much robust way of enforcing security that traditional techniques The intrusion tolerance mechanism based on fragmentation-scattering ensures confidentiality and integrity of data and availability of services Efficient and secure voting algorithms are an essential part of intrusion tolerant systems More research needed to make intrusion tolerance a “real” technology Scope for further research overlapping security and fault-tolerance Concluding Remarks

Fault tolerance vs Security

Intrusion Tolerant Distributed Systems – Algorithms and Architectures

Intrusion Tolerant Distributed Systems – Algorithms and Architectures

Presentation Transcript