Byzantine Replication: Performance Under Attack

Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

Motivation • Society depends on large-scale, distributed computer systems for critical infrastructure. • Insider attacks are a real threat, even for systems designed with security in mind. • Byzantine replicationprovides fault tolerance by protecting against partial system compromises. • Attacker must compromise more than some threshold fraction of the system to cause inconsistency or prevent the system from functioning. • Systems perform well infault-free or benign fault runs. • What about performance when under attack?

The Downside of Asynchrony • Existing correctness criteria: safety and liveness • Safety: servers remain consistent. • Liveness: each update is eventually executed. • Protocols are designed to be safe in all executions. • Do not rely on synchrony for safety! • Guarantee liveness only when the network is sufficiently stable. • Real systems are not completely asynchronous. • Systems can satisfy much stronger performance guarantees than liveness during stable periods. • Consequence: Performance attacks! • An attacker can exploit the gap between what is promised during stable periods (liveness) and what is possible.

Performance Attacks:A First-Hand Look • Red-team attack on Steward [DSN 06]. • Goal was to violate safety or liveness. • Steward survived all of the attacks! • Most did not affect performance. • The system was slowed down in one experiment. • Speed of update ordering was slowed down by a factor of 5. • Big problem: • A better attack could slow the system down by a factor of 100. • But the system is still considered live! • Liveness is a necessary but insufficient correctness criterion for practical systems on wide-area networks.

Byzantine Performance Failures Previously Considered Byzantine Failures • If the adversary cannot violate safety and liveness, the next best thing is to slow down the system beyond usefulness. • Performance failures: send correct messages slowly but without triggering timeouts.

A New Problem: Performance Under Attack • Existing systems are vulnerable to performance attacks. • A small number of faulty servers can cause the system to make progress at an extremely slow rate -- indefinitely! • Leader-based protocols are vulnerable to performance attacks by a malicious leader. • Problem is magnified in wide-area networks, where it is difficult to predict the performance that should be expected of the leader. • Main challenges: • Developing meaningful performance metrics for evaluating Byzantine replication protocols. • Designing protocols that perform well according to these metrics, even when the system is under attack.

Outline • Motivation • Byzantine Performance Failures • Relevant Prior Work • Case Study: BFT Under Attack • The Prime Replication System • Bounded-Delay • Protocol Overview • Experimental Results • Summary

Relevant Prior Work • Leader-based Byzantine replication • BFT [Castro, Liskov 99] • Separating agreement from execution [Yin et al. 03] • Fast Byzantine Consensus [Martin, Alvisi 05] • Zyzzyva [Kotla et al. 07] • Randomized Byzantine replication • SINTRA [Cachin, Portiz 02] • RITAS [Moniz et al. 06] • Quorum-based Byzantine replication • Q/U [Abd-El-Malek et al. 05] • HQ [Cowling et al. 06]

request pre-prepare prepare commit reply Client (Leader) 0 1 2 3 Case Study: BFT Under Attack [Castro and Liskov 99] • Attack 1: Pre-Prepare Delay • Malicious leader can add delay into the ordering path by withholding its Pre-Prepare. • Non-leaders maintain a FIFO queue of pending updates. • Use timeouts to monitor the leader. • Timeout placed on execution of first update in queue. • Malicious leader can stay in power by ordering one update per queue per timeout period!

request pre-prepare prepare commit reply Client (Leader) 0 1 2 3 Case Study: BFT Under Attack [Castro and Liskov 99] • Attack 2: Timeout Manipulation • Timeout doubles every time the leader is replaced. • Use a denial of service attack to increase the timeout, then stop on a malicious leader. • Each update is eventually executed, but performance is much worse than if there were only correct servers.

The Prime Replication System • Performance-Oriented Replication in Malicious Environments • Leader-based protocol providing Bounded-Delay, a stronger guarantee than liveness, when the network is stable. • System components: • Prime Ordering Protocol (Preordering phase, Global ordering phase) • Suspect-Leader Protocol for detecting malicious leaders. • Main Ideas: • Resources needed by the leader to do its job are bounded and independent of system throughput. • Leader has “no excuse” for not sending timely messages. • Non-leader servers compute a threshold level of acceptable performance that the leader should meet. • Upper-bounded by a function of the latency between correct servers after the network stabilizes.

Bounded Delay • Prime-Stability: There is a time after which the following condition holds for a set of at least 2f+1 correct servers (the stable servers): • For each pair of stable servers r and s, there exists a value Min_Lat(r,s), unknown to the servers, such that if r sends a message to s, it will arrive with delay , where • Bounded-Delay: There exists a time after which the update latency for any update initiated by a stable server is upper-bounded.

Prime: Ordering Protocol PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack • Preordering (PO) Phase: • Each server, o, disseminates its updates to the other servers (PO-Request). • Agreement protocol binds update u to preorder identifier (o, i), where u is the ith update originated by server o (PO-ACK). • Each server cumulatively acknowledges the updates it preorders (PO-ARU). O

3 2 1 3 Prime: Ordering Protocol PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack O Preordering Protocol ua, ub, uc (1, 1, ua), (1, 2, ub), (1, 3, uc) Server 1 ud, ue (2, 1, ud), (2, 2, ue) Server 2 (3, 1, uf) uf Server 3 ug, uh, ui (4, 1, ug), (4, 2, uh), (4, 3, ui) Server 4 PO-ARU

Prime: Ordering Protocol PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack • Global Ordering Phase: • Similar to BFT (Pre-Prepare, Prepare, Commit) • Leader periodically sends a Pre-Prepare containing a proof matrix (vector of PO-ARU messages). • Each globally ordered Pre-Prepare maps to a batch of preordered updates based on contents of proof matrix. • Final total order is obtained by deterministically ordering the updates in each batch based on preorder identifier. O

PO-ARU1’ PO-ARU1 PO-ARU2’ PO-ARU2 PO-ARU3 PO-ARU3’ PO-ARU4’ PO-ARU4 Prime: Ordering Protocol PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack O Pre-Prepare 1 Pre-Prepare2 Global Ordering Protocol PP1 PP2 … Final Total Order

Attack Analysis PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack • Key Points: • Preordering phase for updates sent by correct servers cannot be slowed down by faulty servers. • Once all correct servers receive a Pre-Prepare, global ordering cannot be slowed down by faulty servers. • Possible Attacks: • 1. Leader sends its Pre-Prepare to only some correct servers. • 2. Leader sends a Pre-Prepare with out-of-date PO-ARUs. • 3. Leader delays its Pre-Prepare. O

Addition 1: Pre-Prepare Flooding PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L L = Leader O = Originator = Aggregation Delay No Attack • Intuition: 1. The leader must withhold the Pre-Prepare from all correct servers to significantly impact latency. 2. If we can force the leader to send timely, up-to-date Pre-Prepares to at least one correct server, we can ensure timely ordering! O PO REQUEST PO ACK PO ARU PRE PREPARE PREPARE COMMIT L Attack O

Addition 2: Proof Matrix Messages PO REQUEST PO ACK PO ARU PROOF MATRIX PRE PREPARE PREPARE COMMIT L • Each server periodically sends a Proof-Matrix message, containing the latest PO-ARU messages it has received, to the leader. • A correct server expects a leader to include, in its next Pre-Prepare, PO-ARU messages that are at least as up-to-date as those in the Proof-Matrix message. • Why is this expectation justified? • A correct leader can simply adopt any PO-ARU messages that are more up to date than what it currently has. Attack O

Key Idea: Turn-Around Time PO REQUEST PO ACK PO ARU PROOF MATRIX PRE PREPARE PREPARE COMMIT L • Turn-around time • Time between sending a Proof-Matrix message, PM, and receiving a Pre-Prepare “covering” all of the PO-ARU messages in PM. • Key Observation: • The resources required by the leader to send a Pre-Prepare (bandwidth, CPU) are bounded and independent of system throughput. • We can use turn-around time as a measure by which to judge the leader! • Intuition: Force the leader to be timely by ensuring that it provides a fast enough turn-around time to at least one correct server. Attack O

Suspect-Leader Protocol • Protocol Strategy: • Dynamically determine an acceptable turn-around time based on roundtrip measurements (TAT_acceptable). • Use turn-around times measured in the current view to compute a measure of the current leader’s performance (TAT_leader). • Suspect the leader if TAT_leader >TAT_acceptable. • Design Challenges: • Malicious servers can lie to try to lower expectation of acceptable performance. • Leader could remain in power while going slowly. • Malicious servers can lie to make a correct leader look bad. • Would lead to continuous view changes.

= Maximum delay between correct servers = Aggregation delay Suspect-Leader: Key Properties • Any server that retains a role as leader must provide a TAT to at least one correct server that is no more than • Maximum update latency: • There exists a set of at least f+1 correct servers that will not be suspected by any correct server if elected leader. • Aggressive but not overly aggressive. Bounded-Delay!

Experimental Results • 7 servers (f = 2) • Symmetric network • 50ms diameter, 10 Mbps links • Leader performs just well enough to stay in power. • BFT: aggressive timeout (300ms) • BFT: Pre-Prepare delay • Prime: • Leader adds as much delay as possible. • Non-leader servers force as much reconciliation as possible.

Summary • Existing leader-based Byzantine replication protocols are vulnerable to performance attacks. • Liveness is not a meaningful performance metric for evaluating Byzantine replication protocols. • Bounded-Delay: a new performance metric. • Can we provide stronger guarantees? • Can we guarantee a minimum throughput? • Prime: a new Byzantine replication protocol. • Achieves Bounded-Delay when the network is sufficiently stable.

Questions?

Byzantine Replication: Performance Under Attack

Byzantine Replication: Performance Under Attack

Presentation Transcript

Buddhism Under Attack:

CS5412: The cloud Under Attack!

Control Systems Under Attack !?

THE FAMILY UNDER ATTACK

New Deal Under Attack

Under attack!

Depressurization of vessels under fire attack.

CHURCH HISTORY The Church Under Attack!

!! Are we under attack !!

Indian Christians under attack

Distributed Selfish Replication under Node Churn

A Modern President Under Attack:

Patent Eligibility Under Attack

The New Deal Comes Under Attack

General Aviation: Under Attack

The New Deal Under Attack

Paris under attack

Illinois under attack canine assault

!! Are we under attack !!

We are Under Attack