Manufacturing Consensus: An Overview of Distributed Consensus Implementations
by Devashish Purandare · edited by Sohum Banerjea and Lindsey Kuper
Consensus protocols, as the name suggests, are a class of techniques by which some number of distributed participants can agree on a single value – for instance, to resolve conflicting values at different replicas of a data object in a replicated data store. Consensus protocols are some of the most complex algorithms in distributed systems. How do we get replicas to agree?
Don’t do it!
The first rule of consensus is “avoid using it wherever possible”. Consensus is expensive in terms of performance, and it is notoriously hard to reason about and implement. If you can avoid it, you should. There are several ways to avoid running a full-blown consensus algorithm to resolve conflicts between replicas:
- Skip coordination!: The CALM principle shows us that if we can design our program so that “adding things to the input can only increase the output”, we can reliably guarantee eventual consistency as long as all the updates are sent to each replica at least once. The Bloom programming language makes use of this idea.
- Exploit commutativity and idempotence!: CRDTs (Conflict-free Replicated Data Types) are built around the notion of monotonic growth with respect to a lattice. For instance, if you maintain a replicated set that can only grow by means of the set union operation, every replica of the set will eventually converge. There are several challenges with implementations of CRDTs, including “state explosion” (the state maintained for each replica can grow exponentially, to the detriment of storage and performance), complexity, and garbage collection. Austen’s previous blog post goes into more depth about the specifics.
- Talk is cheap; when in doubt, ask around! Gossip or epidemic protocols are widely used to resolve conficts between replicas. Introduced in a 1987 Xerox PARC paper, gossip protocols make progress by sharing update information between replicas. Replicas contact other, randomly chosen replicas and compare contents. Any differences found are fixed by comparing timestamps, or using an application-specific conflict resolution approach. Rumor-mongering is used to maintain hot data changes as special values, until a certain number of replicas reflect them.
Reasons to avoid coordination and consensus when possible
- It is slow: Consensus requires a significant overhead of message passing, locking, voting, and the like until all replicas agree. This is time- and resource-intensive, and can be complicated by failures and delays in the underlying network.
- Partial failures: One of the biggest issues with distributed systems is that partial failures can happen, leading to arbitrary message drops, while the system continues to work. Partial failures are notoriously hard to detect, and can compromise the correctness property of any protocol not designed explicitly to deal with them.
- It is impossible!!: In their Dijkstra-Award-winning paper “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch, and Patterson proved that in an asynchronous network environment, it is impossible to ensure that distributed consensus protocols will succeed – the famous FLP result. A failure at a particular moment can void the correctness of any consensus algorithm. As disheartening as this seems, there’s a way around it. FLP assumes deterministic processes, and by making our algorithms non-deterministic (as in “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols”) can guarantee that a solution is generated “as long as a majority of the processes continue to operate.”
Then why do we need consensus?
If consensus is so terrible, why are we still using complex techniques to achieve it?
Sometimes, eventually consistent isn’t good enough! Strong consistency conditions such as linearizability and serializability require not only that replicas’ final states agree, but that clients cannot observe different intermediate states on the way there. Ensuring that these conditions hold sometimes leaves us with no choice but to use coordination and consensus protocols.
How do consensus protocols work?
It is really hard! Ask the experts! The abstract of Paxos Made Simple (rather infamously) states:
The Paxos algorithm, when presented in plain English, is very simple.
We will discuss why it can be hard to implement nevertheless.
Listen and Learn: Most consensus protocols involve similar steps:
- In an initial phase, replicas pick a Leader that they are going to listen to. The Leader is picked by a majority of nodes using an election-like process.
- Leaders propose values and changes.
- These values are broadcast to Listeners, who select a correct value based on heuristics, such as a minimum quorum.
- The rest of the replicas do not play part in picking a winner; they are told the winning value and update themselves to reflect it.
We’ll discuss two popular consensus protocols, Paxos and Raft.
Leslie Lamport’s paper “The Part-Time Parliament” introduced the world to the Paxos algorithm – or it would have, had anyone understood it. Lamport followed up with “Paxos Made Simple” after realizing that the original paper was a little too Greek for the target audience. Since then, there have been numerous variations on, and optimizations of, the original Paxos algorithm.
Paxos Made Simple
The “Paxos Made Simple” paper states three invariants for correctness:
- Only a value that has been proposed may be chosen,
- Only a single value is chosen, and
- A process never learns that a value has been chosen unless it actually has been.
The algorithm breaks down processes into Proposers (P), Acceptors (A), and Learners (L). The following rules govern how acceptors accept proposals:
- Each acceptor must accept the first value v that it gets.
- After that, the acceptor can only accept any proposal n as long as it has value v and n is greater than its current proposal number.
- Acceptors in the majority with the highest proposal number are picked winners.
Proposers start with a prepare request with number n. An accepted prepare request means that the acceptor will never accept any other request with value less than n. The latest accepted value is communicated along with this response to the proposer. Once a majority has accepted a value, it can be communicated as an accept request to the acceptors as well as the Learners.
The paper’s described system has a distinguished proposer and Learner elected by the processes from among themselves. Lamport’s Paxos is simpler when there is a single P; issues arise when there are multiple proposers. But the paper is short on implementation details, and implementing Paxos in practice is nontrivial, as the next paper we’ll look at shows.
Paxos Made Live
“Paxos Made Live”, originally an invited talk at PODC ‘07 from an engineering team at Google, offers us insight into Google’s Chubby system, which implements distributed locking over the GFS filesystem using Paxos. Chubby uses Multi-Paxos, which repeatedly runs instances of the Paxos algorithm so that a sequence of values, such as a log, can be agreed upon. The implementation is similar to the one we just discussed: elect a coordinator, broadcast an accept message, and if accepted, broadcast a commit message. Coordinators have ordering, and restrictions on values, to ensure that everybody settles on a single value.
Because Multi-Paxos can leave behind some replicas (say, in a network partition), they design a catch up mechanism, using a technique not too different from write-ahead logging.
If you can ensure that the leader doesn’t crash and network doesn’t cause arbitrary delays, the authors suggest chaining updates across the Paxos instances, as there’s only a small chance of network changes between subsequent transfers. This can eliminate doing the propose (prepare) phase for each Paxos instance, saving a lot of time.
You can add bias to the coordinator-picking process to prefer a single coordinator, and skip the election process or avoid having competing coordinators in a majority of cases, saving time.
My favorite part of “Paxos Made Live” is the engineering challenges the authors encountered while implementing Paxos on a real system for Chubby, including:
- Hard disk failures: These can be addressed by converting the coordinator into a non-voter and using the logging mechanism to track everything until the next instance of Paxos picks up.
- Who’s the boss?: The replicas can elect a new coordinator without notifying the present coordinator, causing issues. They solve the problem by adding leases on each coordinator’s duration, during which other coordinators cannot be elected.
“Paxos Made Live” glosses over their handling of the group membership problem:
While group membership with the core Paxos algorithm is straightforward, the exact details are non-trivial when we introduce Multi-Paxos, disk corruptions, etc. Unfortunately the literature does not spell this out, nor does it contain a proof of correctness for protocols related to group membership changes using Paxos. We had to fill in these gaps to make group membership work in our system. The details – though relatively minor – are subtle and beyond the scope of this paper.
How did they solve this? We may never know.
Paxos Made Moderately Complex
Much like “Paxos Made Simple”, “Paxos Made Moderately Complex” implements a much more complex system than what the name implies. The authors implement Multi-Paxos and go into great depth about their implementation and optimizations to it, both as pseudocode and as C++/Python code (which is available online). The authors discuss several Paxos variants, concluding that Paxos is more like a family of protocols rather than a single implementation.
The authors introduce a new term, slots, to reason about
Paxos implementation. Slots are commands for each replica to drive its state
towards the right direction. In case different operations end up in the same
slot across different replicas, Paxos can be used to pick a winner. The slots
are not dissimilar to a write-ahead log discussed in other approaches. Confirmed
operations in slots can then be put in
slot\_out to be communicated to other
replicas. The system then describes several invariants over the slots and
replicas to ensure the Multi-Paxos algorithm remains safe and live.
The authors also introduce the notions of commanders and scouts, commanders being analogous to coordinators. The idea of scouts is interesting: scouts just act during the prepare phase, ensuring enough responses before passing the data back to their leaders. Leaders spawn scouts and wait for an adopted message from them, entering a passive mode loop while scouts try to get a majority response. Once in the active mode, the leaders get consensus to decide which commands are agreed upon in which slot (only one per slot) and use the commanders to dispatch the commands in the second phase to all the replicas.
This paper goes in-depth discussing the original Synod protocol — i.e., the subprotocol of Paxos that implements consensus, as opposed to handling state machine replication — and how it is limited by practical considerations.
The authors of “Paxos made Moderately Complex” also have a great resource which goes in depth about Paxos: paxos.systems.
Paxos Made Pragmatic
The authors of “Paxos Made Moderately Complex” note that satisfying the aforementioned invariants leads to rapid growth in the state that has to be maintained, as we keep all incoming messages at each slot along with the winner. The authors provide a few ways to optimize the implementation and actually make it practical:
- Reduce state: The system can reduce state maintained by keeping only the accepted value for each slot. This can cause issues if there are competing leaders or crashes, as agreed-upon values may be different for the same slot.
- Garbage collection: Once values are agreed upon by a majority of replicas, it is not necessary to keep the old values of applied slots. However, just deleting them might cause an impression that they are empty and can be filled. The authors address this by adding a variable to the acceptor state to track which slots have been garbage-collected.
- Flushing to disk: Periodically, the log can be truncated and flushed to disk, as keeping the entire state in memory is expensive and can be lost in a power failure.
- Read-only operations do not change the state, yet need to get data from a particular state. Treating such operations differently can help us avoid expensive operations.
riak_ensemble: Implementing Vertical Paxos in Erlang
The riak_ensemble consensus library implements Vertical Paxos (i.e., Paxos that allows hardware reconfiguration even if it is in the middle of agreement process) in Erlang to create consensus ensembles on top of key-value stores. riak_ensemble borrows ideas from other implementations, such as the leader leases described in “Paxos Made Live”, and allows leadership changes, tracking metadata with the clever use of Merkle trees. This allows creation of ensembles: consensus groups that implement Multi-Paxos on a cluster. The authors discuss their implementation in this video.
Raft is often touted as a simpler alternative to Paxos. Winner of a Best Paper award at Usenix ATC 2014, Raft promises a way to navigate through the dangerous and scary waters of consensus protocols. The original paper, in fact includes an in-depth user study at Stanford to provide evidence that Raft is easier to understand and explain than Paxos. While it is an open question whether Raft achieved its simplicity goals, it is widely used in a lot of large-scale systems. In this section we will discuss the Raft protocol and some implementations and contrast it with Paxos.
Raft performs leader elections to pick leaders for each round. The condition for leadership is stronger in Raft than in Paxos, as leaders are the only nodes that handle reads and writes to the log as well as log replication to all the replicas. Conflicting entries in followers’ logs can be overwritten to reflect the leader. The leader election protocol works with the help of a timeout mechanism for sending heartbeats to collect votes. The election phase itself comes with a timeout to reduce conflicts.
Formal verification and implementation of Raft with
etcd is a lightweight key-value store implemented using Go. It uses the Raft protocol for distributed consensus to manage its replicated log. etcd’s Raft library is also used by other big projects, like “Kubernetes, Docker Swarm, Cloud Foundry Diego, CockroachDB, TiDB, Project Calico, Flannel, and more”, and is a feature-complete implementation of Raft in Go, including some optional enhancements.
The verification project Verdi base their own key-value
vard, on the etcd implementation of Raft. They implement the Raft protocol
in the Verdi framework and verify it using Coq, a popular
formal verification tool; from there, it can be extracted to OCaml and is available
Raft in CockroachDB
CockroachDB is an open source alternative to Google’s Spanner, a highly available distributed store which uses TrueTime, a globally synchronized clock system, to support ACID properties on top of a distributed data store. The big advantage of these stores is that despite being scalable across continents, they allow linearizability and serializability, along with referential integrity. Raft is used extensively in CockroachDB to ensure that replicas remain consistent.
CockroachDB implements Multi-Raft on top of the Raft protocol to allow better scalability. This involves certain changes to how Raft works. It divides replicas into ranges, which locally implement Raft. Each range performs leader elections and other Raft protocol operations. Ranges can have overlapping memberships. Multi-Raft converts each node’s associated ranges into a group for Raft, limiting the heartbeat exchange to once per tick.
As we’ve seen, consensus protocols can be hard to understand and implement. Could programming language support for expressing these protocols help? In my next post, we’ll consider languages that are specifically designed for the task of implementing distributed protocols such as consensus protocols, and compare implementations in those languages to the general-purpose language implementations discussed here.