Lecture 8. Partial synchrony and the CAP theorem.

HW:

Problem 1. (10 points)

Recall the partially synchronous model from lecture: for an a priori known bound ∆ and an unknown (adversarially chosen) time GST, a message sent at time t is guaranteed to arrive by time max{t, GST} + ∆. (The description of a protocol can depend on ∆ but not GST.) In effect, the network is asynchronous until time GST, after which it is synchronous (with maximum message delay ∆). One intuition we mentioned in lecture is that an ideal consensus protocol should adapt automatically to message delays, operating at close to the network speed. One formalization of this idea is: we’d like a protocol that works (i.e., satisfies safety and liveness) whenever it is run in the synchronous model with some adversarially chosen (but finite) maximum message delay ∆. (In this “unknown ∆” variant of partial synchrony, the protocol description cannot depend on ∆.)

(a) (5 points) Prove that if there is a Byzantine agreement protocol that guarantees agreement, validity, and (eventual, post-GST) termination in the original “GST version” of the partially synchronousnmodel with f (Byzantine) faulty nodes, then there is also such a protocol in the “unknown ∆ version” (i.e., one that satisfies agreement, validity, and termination, no matter what ∆ is).

(b) (5 points) Prove that if there is a Byzantine agreement protocol that guarantees agreement, validity, and termination in the “unknown ∆ version” of the partially synchronous model with f (Byzantine) faulty nodes (no matter what ∆ is), then there is also such a protocol in the original “GST version” (i.e., one that satisfies agreement, validity, and eventual termination).

Problem 2. (16 points)

This problem continues our study of Byzantine agreement protocols in the partially synchronous model (the original GST version).

(a) (proved in class) Prove that with n = 3 nodes and at most f = 1 Byzantine node, no deterministic BA protocol satisfies agreement, validity, and eventual termination. Your proof should apply even assuming the availability of PKI.

(b) (6 points) Extend the impossibility result in (a) to all n ≥ 3 and f ≥ n/3.

(c) (5 points) Compare this impossibility result to the one from Lecture 3. In what way is it stronger?

(d) (5 points) Prove that this impossibility result extends to the SMR problem, in the following sense: if f ≥ n/3, then no deterministic protocol for the SMR problem satisfies consistency (at all times) and eventual liveness (required only after GST).

Problem 3 (15 points)

This problem is the same as the previous one, except with a crash-fault adversary.

(a) (10 points) Prove that with n = 2 nodes and at most f = 1 faulty node (with a crash-fault adversary), no deterministic BA protocol satisfies agreement, validity, and eventual termination.

(b) (5 points) Extend the impossibility result in (a) to all n ≥ 2 and f ≥ n/2. (Remark: there is also a matching positive result, meaning there exists a deterministic BA protocol that, in the partially synchronous model and with f < n/2 crash-fault nodes, satisfies agreement, validity, and eventual termination.)

The story so far

Synchrony
?
Asynchrony
Permissioned
✅ PKI, any f<n ⇒ BB protocol ❌ no PKI, f≥n/3 ⇒ no BB protocol
?
❌ f=1 ⇒ no BA protocol

Synchronous model (Lec 5 + 6) - shared global clock + a priori known bound ∆ of max message delay - good news: strong positive results (Dolev-Strong ⇒ BB+SMR, no matter what f is) - bad news: assumptions are too strong (need to account for outages + DoS attacks)

Asynchronous model (Lec 7) - no global clock, no assumption on message delays (other than eventual delivery) - good news: weak assumptions ⇒ any positive results automatically impressive + useful - bad news: FLP ⇒ no positive results possible! (even if f=1, cause assumptions too weak)

Idea: outages/attacks end eventually, right?

The Partially Synchronous Model

Idea:

  • “normal conditions” ⇒ synchronous model
  • ”attack” ⇒ asynchronous model (once attacks stops, want protocol to quickly resume normal operation)
  • Model will start with asynchronous setting, and will transition to synchronous. More generally, one can “stack” this model to imitate real life, and one eventually obtains the same results.
  • Credits: C. Dwork, N. Lynch, and L. Stockmeyer, Consensus in the Presence of Partial Synchrony, Journal of the ACM, 1988.

Assumptions:

  • Shared global clock (OR one can assume “bounded drift” of clocks; they showed this is basically equivalent to shared global clock)
  • Known bound ∆ on max msg delay in normal conditions
  • Unknown transition time GST (”global stabilization time”) from asynchronous to synchronous models (this is the time that the hypothetical DoS attack or outage lasts). Unknown, so the protocol by itself should detect the transition to synchronous setting.
  • Scheme: |t=0|____Async phase____|t=GST, unknown|____Sync phase, ∆ msg delay ______
  • More formally, Promises on Message Deliveries: (1) msg sent at time t ≤ GST ⇒ arrives by the time GST+∆ (2) msg sent at time t ≥ GST ⇒ arrives by time t+∆

Goals:

  • To develop a protocol, i.e. specify msgs to send as a function of node’s private input + received msgs + current time step. It should satisfy:
  • Safety OR liveness during async phase
  • Safety AND liveness during sync phase

Remark. There is a second model of Partially Synchronous setting, roughly equvalent, where ∆ is unknown a priori: there is no async phase there, only the sync phase with ∆ max msg delay, except ∆ is unknown a priori.

Goals for a Consensus Protocol

Note: FLP impossibility result does not immediately apply, because adversarial msg delivery is fundamentally weaker, adversary cannot arbitrarily choose the order of msg delivery. (Except in asynchronous phase)

Traditional goals:

  1. Not long after GST, safety + liveness both hold (for SMR, BA, BB problems safety and liveness mean slightly different things)
  2. Safety holds always (even in asynchronous phase) [longest chain protocols instead favor liveness over safety!]

Big result: [DLS] (THE result cited in many many blockchain whitepapers) 1. + 2. achievable     \ifff < n/3 [i.e., n ≥ 3f+1]

  •     \implies is the impossibility result that tells that if f ≥ n/3, then there is no consensus protocol satsifying 1. and 2.
  •     \impliedbyis the Tendermint protocol we will be studying in the future

Intuition for Impossibility (f ≥ n/3 ⇒ protocol cannot satisfy 1. + 2.)

Fact: unlike in Lecture 6, here the PKI assumption won’t really matter (even though the threshold is the same)

Therefore, impossibility result must be driven by the threat of unbounded msg delays, not the simulation of honest nodes by Byzantine nodes (as in the hexagon proof).

Inuition:

  1. Can only wait to hear from n-f nodes before taking action [by termination + the fact that Byzantine nodes may never send msgs, even after GST] Issue: the missing f nodes might be delayed rather than Byzantine (if pre-GST)], so f of those n-f nodes you’ve heard from might still be Byzantine.
  2. To avoid getting tricked, need to have the majority of n-f nodes honest (else, whom to believe?) ⇒ need f<12(nf)f<\frac 1 2 (n-f), or equivalently, f<n/3f<n/3.

Proof of Impossibility (f ≥ n/3 ⇒ protocol cannot satisfy 1. + 2.)

We will state it for BA problem, but it will automatically apply to SMR

Theorem. In the partially synchronous model with f ≥ n/3, there is no protocol for Byzantine agreement satisfying agreement, and eventually (after GST) validity and termination.

Proof: [for the n=3, f=1 case; general case is similar]

image

Assume, for contradiction, that the requisite protocol exists. Strategy of the adversary:

  • Delay msgs between A and C for a long time (max(T1T_1T2T_2)+1; see below for T1T_1T2T_2). It can do it because the adversary gets to choose GST…
  • B interacts with A as if it is an honest node with v=1
  • B interacts with A as if it is an honest node with v=0

From the perspective of A:

  • A cannot distinguish from the following two situations (1) B is Byzantine and msgs from C simply did not arrive yet (2) B is honest and C crashed forever (as a Byzantine node)
  • Because the protocol must satisfy termination, at some point A has to output something, and therefore it must deterministically assume that it is in situation (2) or (1).
  • Termination + Validity ⇒ A, hearing msgs from B with v=1, outputs 1
  • Say A outputs 1 at time T1T_1

From the perspective of C:

  • Same argument shows that C outputs 0 at time T2T_2

We arrive at a contradictions, because the protocol doesn’t satisfy agreement.

qed

The CAP Theorem (Brewer conjectured, Gilbert/Lynch proved, early 2000s)

  • C for “consistency” [distributed system’s behavior indistinguishable from a centralized system]
  • A for “availability” (~liveness) [every command issued by a client eventually is carried out]
  • P for “partition tolerance” [properties C and A should hold even when there’s a network partition]

Network partition: bunch of nodes [A] ←all mgs blocked→ bunch of nodes [B] (e.g., due to a DoS attack on nodes B)

CAP Theorem: forced to pick 2 out of 3. (regardless of f)

⇒ during a network partition, must choose between availability and consistency.

Proof idea:

Initially, x=0.

Client issues a command “x:=1” to a node ii that is in [A].

All future clients issue command “return x”.

⇒ if ii ever answers “1”, this violates consistency (because nodes in [B] all return 0)

⇒ if ii always answers “0”, this violates availability=liveness (because Client’s command “x:=1” was not executed)

qed.

Examples:

  • google (with its many databases) prefers A to C of course.
  • bank (with its many databases) prefers C to A of course.

FLP Theorem vs CAP Theorem

Common takeaway: when under attack (asynchrony / network partition), need to choose between safety/consistency and liveness/availability.

Why then FLP thm was much harder than CAP thm? Here are the distinctions:

CAP

  1. network partition can last forever
  2. adversary restricted to network partition
  3. applies with all honest nodes [only adversary is msg delivery]

FLP

  1. every msg eventually delivered
  2. adversary can do whatever (subject to eventual delivery)
  3. needs at least one faulty node (though one crash fault suffices, which resembles infinite message delay)

(1) ⇒ adversary is stronger in CAP than in FLP.

(2) ⇒ adversary is stronger in FLP than in CAP, but this turns out to be not important.

Conclusion: the essence of the FLP proof is that a single crash fault already capture enough of the power of infinite msg delays to trigger the same conclusion as in CAP theorem.