Lecture 8. Partial synchrony and the CAP theorem.

‣

Resources:

‣

HW:

The story so far

	Synchrony	?	Asynchrony
Permissioned	✅ PKI, any f<n ⇒ BB protocol ❌ no PKI, f≥n/3 ⇒ no BB protocol	?	❌ f=1 ⇒ no BA protocol

Synchronous model (Lec 5 + 6) - shared global clock + a priori known bound ∆ of max message delay - good news: strong positive results (Dolev-Strong ⇒ BB+SMR, no matter what f is) - bad news: assumptions are too strong (need to account for outages + DoS attacks)

Asynchronous model (Lec 7) - no global clock, no assumption on message delays (other than eventual delivery) - good news: weak assumptions ⇒ any positive results automatically impressive + useful - bad news: FLP ⇒ no positive results possible! (even if f=1, cause assumptions too weak)

Idea: outages/attacks end eventually, right?

The Partially Synchronous Model

Idea:

“normal conditions” ⇒ synchronous model
”attack” ⇒ asynchronous model (once attacks stops, want protocol to quickly resume normal operation)
Model will start with asynchronous setting, and will transition to synchronous. More generally, one can “stack” this model to imitate real life, and one eventually obtains the same results.
Credits: C. Dwork, N. Lynch, and L. Stockmeyer, Consensus in the Presence of Partial Synchrony, Journal of the ACM, 1988.

Assumptions:

Shared global clock (OR one can assume “bounded drift” of clocks; they showed this is basically equivalent to shared global clock)
Known bound ∆ on max msg delay in normal conditions
Unknown transition time GST (”global stabilization time”) from asynchronous to synchronous models (this is the time that the hypothetical DoS attack or outage lasts). Unknown, so the protocol by itself should detect the transition to synchronous setting.
Scheme: |t=0|____Async phase____|t=GST, unknown|____Sync phase, ∆ msg delay ______→
More formally, Promises on Message Deliveries: (1) msg sent at time t ≤ GST ⇒ arrives by the time GST+∆ (2) msg sent at time t ≥ GST ⇒ arrives by time t+∆

Goals:

To develop a protocol, i.e. specify msgs to send as a function of node’s private input + received msgs + current time step. It should satisfy:
Safety OR liveness during async phase
Safety AND liveness during sync phase

Remark. There is a second model of Partially Synchronous setting, roughly equvalent, where ∆ is unknown a priori: there is no async phase there, only the sync phase with ∆ max msg delay, except ∆ is unknown a priori.

Goals for a Consensus Protocol

Note: FLP impossibility result does not immediately apply, because adversarial msg delivery is fundamentally weaker, adversary cannot arbitrarily choose the order of msg delivery. (Except in asynchronous phase)

Traditional goals:

Not long after GST, safety + liveness both hold (for SMR, BA, BB problems safety and liveness mean slightly different things)
Safety holds always (even in asynchronous phase) [longest chain protocols instead favor liveness over safety!]

Big result: [DLS] (THE result cited in many many blockchain whitepapers) 1. + 2. achievable $\iff$ f < n/3 [i.e., n ≥ 3f+1]

$\implies$ is the impossibility result that tells that if f ≥ n/3, then there is no consensus protocol satsifying 1. and 2.
$\impliedby$ is the Tendermint protocol we will be studying in the future

Intuition for Impossibility (f ≥ n/3 ⇒ protocol cannot satisfy 1. + 2.)

Fact: unlike in Lecture 6, here the PKI assumption won’t really matter (even though the threshold is the same)

Therefore, impossibility result must be driven by the threat of unbounded msg delays, not the simulation of honest nodes by Byzantine nodes (as in the hexagon proof).

Inuition:

Can only wait to hear from n-f nodes before taking action [by termination + the fact that Byzantine nodes may never send msgs, even after GST] Issue: the missing f nodes might be delayed rather than Byzantine (if pre-GST)], so f of those n-f nodes you’ve heard from might still be Byzantine.
To avoid getting tricked, need to have the majority of n-f nodes honest (else, whom to believe?) ⇒ need $f<\frac 1 2 (n-f)$ , or equivalently, $f<n/3$ .

Proof of Impossibility (f ≥ n/3 ⇒ protocol cannot satisfy 1. + 2.)

We will state it for BA problem, but it will automatically apply to SMR

Theorem. In the partially synchronous model with f ≥ n/3, there is no protocol for Byzantine agreement satisfying agreement, and eventually (after GST) validity and termination.

Proof: [for the n=3, f=1 case; general case is similar]

Assume, for contradiction, that the requisite protocol exists. Strategy of the adversary:

Delay msgs between A and C for a long time (max( $T_1$ $T_2$ )+1; see below for $T_1$ $T_2$ ). It can do it because the adversary gets to choose GST…
B interacts with A as if it is an honest node with v=1
B interacts with A as if it is an honest node with v=0

From the perspective of A:

A cannot distinguish from the following two situations (1) B is Byzantine and msgs from C simply did not arrive yet (2) B is honest and C crashed forever (as a Byzantine node)
Because the protocol must satisfy termination, at some point A has to output something, and therefore it must deterministically assume that it is in situation (2) or (1).
Termination + Validity ⇒ A, hearing msgs from B with v=1, outputs 1
Say A outputs 1 at time $T_1$

From the perspective of C:

Same argument shows that C outputs 0 at time $T_2$

We arrive at a contradictions, because the protocol doesn’t satisfy agreement.

qed

The CAP Theorem (Brewer conjectured, Gilbert/Lynch proved, early 2000s)

C for “consistency” [distributed system’s behavior indistinguishable from a centralized system]
A for “availability” (~liveness) [every command issued by a client eventually is carried out]
P for “partition tolerance” [properties C and A should hold even when there’s a network partition]

Network partition: bunch of nodes [A] ←all mgs blocked→ bunch of nodes [B] (e.g., due to a DoS attack on nodes B)

CAP Theorem: forced to pick 2 out of 3. (regardless of f)

⇒ during a network partition, must choose between availability and consistency.

Proof idea:

Initially, x=0.

Client issues a command “x:=1” to a node $i$ that is in [A].

All future clients issue command “return x”.

⇒ if $i$ ever answers “1”, this violates consistency (because nodes in [B] all return 0)

⇒ if $i$ always answers “0”, this violates availability=liveness (because Client’s command “x:=1” was not executed)

qed.

Examples:

google (with its many databases) prefers A to C of course.
bank (with its many databases) prefers C to A of course.

FLP Theorem vs CAP Theorem

Common takeaway: when under attack (asynchrony / network partition), need to choose between safety/consistency and liveness/availability.

Why then FLP thm was much harder than CAP thm? Here are the distinctions:

CAP

network partition can last forever
adversary restricted to network partition
applies with all honest nodes [only adversary is msg delivery]

FLP

every msg eventually delivered
adversary can do whatever (subject to eventual delivery)
needs at least one faulty node (though one crash fault suffices, which resembles infinite message delay)

(1) ⇒ adversary is stronger in CAP than in FLP.

(2) ⇒ adversary is stronger in FLP than in CAP, but this turns out to be not important.

Conclusion: the essence of the FLP proof is that a single crash fault already capture enough of the power of infinite msg delays to trigger the same conclusion as in CAP theorem.