# An Overview of Distributed Systems and the Consensus Problem

## Distributed Systems

For the purposes of this overview, we'll define a distributed system as any single system that comprises of multiple separate processes that exists on different machines that communicate with one another over some network. In principle, the machines that run these processes could just as well be physically located on different continents as they could be located in the same closet. To account for this, we tend to make minimal assumptions about the properties of the network. Specifically, we allow for the possibility that messages can be lost or delayed indefinitely.Knowledgeable readers might recognize this as the shared-nothing asynchronous system model definition where "asynchronous" refers to the lack of timing assumptions we can make.

This broad definition means that a lot of software systems of varying inherent complexity can be thought of as distributed systems. This includes everything from global financial services such as bank systems and payment processors to data-hungry online ad targeting platforms that collect unfathomable amounts of clickstream data to any run-of-the-mill CRUD app that makes sure to store a backup copy of the app data on a separate machine.

As such, it stands to reason that understanding the fundamental challenges and advantages inherent in the design of such systems is an important element of understanding the field of software system design as a whole.

### Fundamental Challenges of Distributed Systems

#### Performance

As compared to all the other places that a CPU can fetch data from or push data to,Registers, cache, memory and disk, in order from closest to furthest away. a remote machine (even one within the same datacenter) is exceptionally distant. Considering the fact that relativity dictates that information can only ever travel as fast as the speed of light, this physical distance manifests itself in relatively dog-slow latency when communicating over the network.

Furthermore, that isn't necessarily the worst of it. In practice, it can be far worse!

As we'll see later on, there are two types of distributed coordination protocols: those that require a response from every process before proceeding (i.e. full quorum protocols) and those that require a response from the majority of processes before proceeding (i.e. majority quorum protocols). Let us assume that the round-trip times from all processes are i.i.d. random variables that are normally distributed with mean $$\mu$$ and standard deviation $$\sigma$$;.

For majority quorum protocols, the overall latency is bounded by the median round-trip latency. It can be shown that the expectation of the median of i.i.d. random variables is the median of the underlying distribution of the random variables. This seems to be as good as one might hope.

However, for full quorum protocols, it can be shown that the overall latency for a system with $$N$$ processes is upper bounded by $$\mu + \sigma \sqrt{2log(N)}$$.The linked derivation is for the special case where $$\mu = 0$$, but if you add an extra $$\mu t$$ additive term to the cumulant expression, then you get an extra $$\mu$$ additive term falling out in the final result (in accordance with natural intuition). This means that the greater the variability in the round-trip time ($$\sigma$$) or the greater the number of processes in the system ($$N$$), the worse you can expect performance to be.

Furthermore (as we'll also see later on), both full quorum protocols and majority quorum protocols require at least 2 round-trips before completion. Therefore, even majority quorum protocols involve taking a performance hit, even if they happen to scale well for a large number of processes.

In summary, it pays to respect the principle of locality of reference and distributed systems are as non-local as it gets. For a case study in this, see this article from Adam Drake where he managed to get a 235 times performance boost from rewriting a Hadoop job to a CLI one-liner. As Frank McSherry said in his blog post about scalability:

1. If you are going to use a big data system for yourself, see if it is faster than your laptop.
2. If you are going to build a big data system for others, see that it is faster than my laptop.
To put it in my own words, Hadoop and similar tools are not a substitute for regular-ass parallel processing.

#### Debuggability

Performance isn't the only place where relativity rears its head.

Let's say you want to record the ordering of events $$E_1$$ and $$E_2$$ for debugging purposes.Certainly, being able to step through the sequence of events that occurred in your system is an extremely useful property for debugging purposes. This is, in fact, how every single debugger works, in some form or fashion. Assume, then, that some observer observes event $$E_1$$ occurring at time $$t_1$$ and location $$L_1$$ and this same observer observes event $$E_2$$ occurring at some later time $$t_2$$ (i.e., $$t_2 > t_1$$) and location $$L_2$$. For our purposes, there are two different scenarios we are concerned with:

• $$\Delta L \leq c \Delta t$$Where $$c$$ is the speed of light.: That is, it is possible for a beam of light to have traveled between location $$L_1$$ and location $$L_2$$ in the time between the observer's observations of $$E_1$$ and $$E_2$$. We call events that have this relationship timelike separated. Every single observer that could possible exist observed event $$E_1$$ occurring at some time before $$E_2$$.
• $$\Delta L > c \Delta t$$: That is, it is impossible for a beam of light (and therefore any sort of information at all) to have traveled between location $$L_1$$ and location $$L_2$$ in the time between the observer's observations of $$E_1$$ and $$E_2$$. We call events that have this relationship spacelike separated. In the case of spacelike separated events, it can be shown that there could always exist some observer that observed event $$E_2$$ occurring before event $$E_1$$. Therefore, you cannot apply a total order on spacelike separated events.

Given that you cannot apply a total order on spacelike separated events and given the nature of distributed systems where processes can be running on different machines located in different parts of the world, it is then the case that the fundamental laws of physics don't necessarily admit a total order of events across a distributed system in the general case. This seems to throw a major wrench in our ability to debug distributed programs.

However, one may reasonably counter that the machines in their system are closely located enough to obviate the possibility of spacelike separated events for all practical purposes. There is another, more mundane reason to not rely on ordering produced by different clocks: the concept of clock drift.In practice, a clock drift of 1 second per day can be used as a rough baseline.

While time-based ordering is not very robust, especially if you're dealing with a large number of events and/or a large number of different machines in your system, all hope is not lost! We could, instead of using time-based clocks, use so-called logical clocks to try sequence our events. Logical clocks are data structures that preserve the causal relationship between events, without explicitly making use of the concept of time. The most common type of logical clock used in practice is a so-called vector clock.

In a vector clock, we maintain an array $$[t_1, t_2, ..., t_N]$$ of $$N$$ counters, one for each process in our system. Note that not only do we store a vector of counters for each process in our system, but each process also stores a copy of the logical clock vector (or timestamp) as well (i.e. for $$N$$ processes, each of the $$N$$ processes would keep track of its own $$N$$-dimensional timestamp). Whenever a machine sends or receives a message or does some work, it updates its vector clock timestamp according to the following rules:

• Whenever a machine does work, increment the vector clock timestamp value corresponding to that machine (e.g. process 5 would increment the 5th counter in the timestamp)
• Whenever a machine sends a message, include the full vector clock timestamp
• When a message is received:
• update each element in the vector clock to be $$max(local, received)$$
• increment the vector clock timestamp value corresponding to that machine

To get a sense of the usefulness of this, suppose we again have events $$E_1$$ and $$E_2$$ and we would like to understand the relative ordering of them. Let $$VC_{E_1}$$ be the vector clock timestamp at event $$E_1$$ and let $$VC_{E_2}$$ be the vector clock timestamp at event $$E_2$$. Define the relationship $$VC_{E_1} < VC_{E_2}$$ to mean that the values of vector $$VC_{E_1}$$ are less than or equal the values of vector $$VC_{E_2}$$ for all indices with at least one of the indices being strictly less than.

Using the above definitions, we can say:

• $$E_1$$ and $$E_2$$ are causally related if and only if $$VC_{E_1} < VC_{E_2}$$ or $$VC_{E_2} < VC_{E_1}$$. Specifically, for $$VC_{E_1} < VC_{E_2}$$ then we can definitively say that $$E_1$$ occurred before $$E_2$$ (and vice versa).
• $$E_1$$ and $$E_2$$ are concurrent if and only if neither $$VC_{E_1} < VC_{E_2}$$ nor $$VC_{E_2} < VC_{E_1}$$.

While this doesn't give us a total ordering of all events in our system (which relativity tells us would be an artificial construct anyway, as explained above), it does give us a reliable partial ordering of the events in our system. Specifically, the sequence of all causal histories in our system is reliably preserved.

#### Understandability

Suppose we're interested in writing some new data into our distributed system and then reading that same data out at some later point. Furthermore, let's suppose that while all of this is happening, a network partitionA network partition is when subsets of the network become completely isolated from one another for some period of time. happens across our system splitting it into subsystems $$S_1$$ and $$S_2$$ that are unable to communicate with one another. If our write request is sent to a machine in $$S_1$$ and our later read request is sent to a machine in $$S_2$$, then how should our system respond?

We would ideally like for the distributed nature of our system to be abstracted away so we can interface with it in ways that are most familiar. That is, it would be nice if all reads and writes on our system worked exactly like reads and writes on some data object held in local memory. This means we would like for our read request (sent to $$S_2$$) to return the value that we wrote (to $$S_1$$). However, for $$S_2$$ to be able to return the value written to $$S_1$$, $$S_1$$ would had to have sent that information to $$S_2$$ which violates the definition of a network partition. Therefore, our dream of being able to view our system as just another data object is dead in the water in the face of network partitions.One might object to the scenario described above, however, by saying that the whole idea of a network partition is a largely contrived concept anyway. Does it even make sense to care about network partitions in practice? The answer seems to be "Yes". In fact, Aphyr enumerates a myriad of case studies that suggest that network partitions are less contrived then one might be tempted to believe. This suggests that the practice of designing systems with this failure mode in mind is indeed sound in practice.

Given the reality that this is something that we must deal with, how should our system respond in the face of a network partition? As it turns out, there are two valid choices one could make. Deciding which one to choose is highly context-dependent.

1. One choice is to allow our write request to $$S_1$$ succeed and then let our read request to $$S_2$$ return a stale value. A system designed in such a way to do this is said to choose availability over consistency. We say it's available because we try our best to allow all read and write calls to succeed. We say it isn't consistent because, in allowing the write call to $$S_1$$ to succeed (which is not propagated to $$S_2$$), we no longer have a single snapshot of the data that exists across all processes in the system. This is why our read request returns a stale value.
2. Another choice is to let the write request to $$S_1$$ fail because it cannot coordinate with $$S_2$$ and then let our read request to $$S_2$$ return an up-to-date value. A system designed in such a way to do this is said to choose consistency over availability. We say it's consistent because we don't allow any write which would violate the principle that each process shares the exact same copy of the data. We say it isn't available because, in order to achieve this consistency, we let requests fail if our coordination protocol fails.

The tradeoff described above is known commonly as the CAP theorem which states that you can only choose at most 2 out of 3 from consistency, availability and partition tolerance.There also exists an extension of the CAP theorem known as the PACELC theorem. The PACELC theorem states that in case of a Partition, choose between Availability and Consistency. Else, choose between optimizing Latency and Consistency. The reason why this is the case is that the coordination protocols that enforce consistency necessarily induce a latency penalty, as described in the Performance section.

### Fundamental Advantages of Distributed Systems

With all that being said, there are some good properties that distributed systems have over single process systems.

#### Resilience

If we were to replicate data redundantly across multiple machines in our system, then it becomes possible for the complete failure of one or more machines to have little or no impact on the overall system. Resilience via redundancy is a commonly applied principle in engineering in general, especially in safety-critical systems like spacecrafts and aircrafts. This allows us to be robust to entirely different classes of failures than are possible with single process systems, such as a complete failure of a hard diskThough the likelihood of this can be obviated by using an appropriate RAID scheme. or even complete datacenter failures.

#### Content Serving Latency

We mentioned before when discussing disadvantages of distributed systems how physical proximity plays a role in latency. While this suggests we want to keep any sort of internal computation as local as possible, it also suggests a way to be clever when distributing content to outside clients. Instead of having one central location that all clients must retrieve data from, we can store replicas of the data across multiple locations and have each client retrieve the data they want from the geographically closest replica.Indeed, this is exactly how CDNs work. This type of setup can make content retrieval multiple times faster than having content served from one central location.

#### Cost

Finally, there is the dollars and cents view of things. Commodity level hardware tends to have more favorable price per performance characteristics than high-end hardware. Therefore, if you need to accomplish some computational task at a large scaleWhether that's large scale in terms of memory or CPU cycles or disk space or IOPS., it can be drastically more economical to perform the computational task over a distributed system rather than building out one high-end machine to run a single process. In fact, some back-of-the-envelope calculations from Barroso and Holzle (page 32), suggest that commodity hardware can be up to 20 times more cost efficient than high-end hardware. Barroso and Holzle (page 35) also show that the performance gap between a cluster of high-end machines and a cluster of commodity grade machines decreases as the size of the cluster increases and the amount of communication overhead decreases.

There are two main paradigms under which we can model distributed systems: weak consistency and strong consistency.

We can define weak consistency to mean that different processes can have differing copies of the data at any point. Analogously, we can define strong consistency to mean that all processes must always share a single canonical view of the data.Note that we've discussed this distinction before when describing the CAP theorem. The system that accepted writes during network partition in our thought experiment could be said to exhibit weak consistency while the system that rejected writes during network partition could be said to exhibit strong consistency.

These definitions are somewhat limiting, though, in their ability to help us understand distributed systems. In particular, under our definition, weak consistency isn't really a guarantee so much as it is the absence of a guarantee. In fact, a system where every process accepts reads and writes completely independently of one another could be said to exhibit weak consistency for the given definition, which underlines its practical uselessness.

For this reason, we'll choose to focus our attention specifically on the notion of eventual consistency, which is a subset of weak consistency. Under eventual consistency, different processes may have differing copies of the data at any point, but they will eventually agree on a single canonical view of the data.

### Eventual Consistency Protocols

In order to achieve eventual consistency, there are two main (non-mutually exclusive) modes of coordination: coordination that happens at read time (i.e. read repair) and coordination that happens asynchronously in the background on a constant basis (i.e. asynchronous repair). Additionally, there are certain computations that you can formulate in a way that is eventually consistent without performing any coordination at all.

In read repair, when the client asks for a piece of data, the data is retrieved from $$R$$ of $$N$$ machines where $$R$$ is a configurable value.Note that if it is the case that, when writing, data must be written to $$W$$ of $$N$$ processes where $$R + W > N$$ then every write will seen by a subsequent read. This is because every read quorum (i.e. set of $$R$$ processes you must read from) must intersect with every write quorum (i.e. set of $$W$$ processes you must write to) at at least one process. Along with these $$R$$ copies of the data, each machine sends a vector clock timestamp of the last time that piece of data was updated. The vector clock timestamps allow us to determine which copy is the most recent copy. In the case where there is a tie for which copy is the latest between two or more copies (i.e. the copies have concurrent vector clock timestamps), the client must have some logic to determine which copy to choose. With the most recent copy of the data across the $$R$$ machines now known, that copy is written to the machines that did not previously have it.

#### Asynchronous Repair

We'll discuss two different methods of asynchronous repair: primary/backup replication, which has a single point of failure, and gossip protocols, which do not.

##### Primary/Backup Replication

In primary/backup replication (or primary/backup log shipping), we designate one process as the "primary" (which is our single point of failure) and the other process(es) as the "backup". What this means is that whenever a write request gets sent to the system, the request immediately executes on the primary and then the primary sends a log of the corresponding changes to the backup. The backup process(es) then execute these changes once they receive them from the primary. The time between the request executing on the primary and the request executing on the backup is referred to as the replication lag.

One thing to note about primary/backup replication is that it is susceptible to what is known as "split-brain". Split-brain happens when the primary is unavailable for some temporary period which initiates failover which promotes one of the backup processes to be the primary. When the initial primary becomes available again, it still believes itself to be the primary meaning. This means that there are now two different processes that believe themselves to be the primary in this scenario.

##### Gossip Protocols

Gossip protocols are a general term for peer-to-peer protocols meant to disseminate information across members of some network. As such, there are many different gossip protocol implementations, but they mostly can be described simply as a protocol such that every $$t$$ seconds, each member of the network synchronizes with some other randomly selected member of the network with probability $$p$$. The values $$t$$ and $$p$$ are chosen so that the latency involved with synchronization doesn't overwhelm the members of the network.

#### No Coordination

##### CRDTs

CRDTs (i.e. convergent replicated data types) are data structures that can be replicated across different processes in a distributed system. They have the property that if the CRDT stored in one process receives the same set of messages (regardless of message order or message redelivery) as the CRDT stored in another process, then the two copies of the CRDT will resolve to the same exact value with any direct coordination.

Concretely, if the computation you are performing satisfies the following properties:

• Associativity: $$a+(b+c) = (a+b)+c$$ (i.e. invariant with respect to grouping)
• Commutativity: $$a+b = b+a$$ (i.e. invariant with respect to order)
• Idempotency: $$a+a = a$$ (i.e. invariant with respect to duplication)

then the computation can be formulated in terms of a CRDT.The term for mathematical constructs that have some such operation defined over them is a semilattice.

The set of computations that can be formulated using CRDTs is somewhat limited, but it is an active field of research.Note that there is a deeply related construct also called CRDTs (commutative replicated data types) that obeys associativity and commutativity but not idempotency. Such data structures rely on the underlying network to enforce idempotency through exactly-once delivery of all messages.

##### CALM Theorem

The CALM (consistency as logical monotonicity) theorem defines the specific conditions under which a distributed program has a consistent result without the need for any coordination.As such, it can be seen as a generalization of the overall premise behind CRDTs. Specifically, it states that a distributed program has a consistent result without the need for any coordination if and only if it is monotonic.

We say that a program is monotonic if it can be formulated in terms of monotonic logic. Monotonic logic is a logical framework through which it is impossible to invalidate any premise based on new information. Accordingly, non-monotonic logic is a logical framework through which it is possible to invalidate a premise based on new information.

Thinking concretely in terms of the sorts of SQL/Datalog queries that correspond to monotonic logic, any query whose results can only get more accurate with more information would qualify as being logically monotonic.For example, retrieving the set all of the records that have some property $$P$$ would only get more accurate as you process more records. Meanwhile, you can think of any query whose results don't necessarily get more accurate with more information as being logically non-monotonic.For example, when aggregating the sum of some numeric property $$P$$ across a set of records, it may be the case that the retrieval of a new record might take your computed aggregated value further away from the "true" value.

Much like CRDTs, the practical use cases of this are limited at the time of writing, but it is an active field of research.Notably, the BOOM group at Berkeley developed the Bloom language to take advantage of CALM theorem. It's unclear if it's still under development but it appears as if the Fluent project from the RISE group at Berkeley is its spiritual successor.

### Strong Consistency Protocols

We can divide the set of strong consistency protocols into two categories:

• Protocols that are not robust to network partitions
• Protocols that are robust to network partitions
We'll now describe the two-phase commit protocol (which fits into the first category) and the Paxos protocol (which fits into the second category).

#### Two-Phase Commit

In the two-phase commit protocol, there is one process that acts as the coordinator (analogous to the primary from primary/backup replication) with the other processes acting as its peers (analogous to the backup from primary/backup replication). Furthermore, the two-phase commit protocol has (unsurprisingly) two phases: a voting phase and a commit phase.

In the voting phase, the coordinator takes a write request and sends a message to each of its peers asking them if it is okay for them to commit this write request. Each peer writes the request to temporary storage and sends a reply message back to the coordinator telling it whether or not the peer can commit that request. The coordinator waits for every single peer to respond. This concludes the voting phase.

If every single peer responds that it is okay to commit the write request, then the coordinator sends a message to each of its peers telling them to commit. If even one of the peers responds that it is not okay to commit the write request, then the coordinator sends a message to each of its peers telling them to rollback the request. Each peer correspondingly either commits the request from temporary storage to permanent storage or removes the request from temporary storage and then sends an acknowledgment back to the coordinator. Once the coordinator receives an acknowledgment from every peer, then it sends a response to the client that issued the request.

As mentioned before, two-phase commit is not robust to network partitions and a keen reader would be able to immediately see why. In both of the phases, the coordinator must wait for a response from every single peer. If any one of those peers are behind a partition boundary, then the protocol will be waiting for an indefinitely long period of time (i.e. until the partition is resolved).

#### Paxos

Paxos is a coordination protocol that solves the problem of distributed consensus. What that means in practical terms is that Paxos gives us the ability for distributed processes to agree on some single valueWhen speaking about "values" to agree upon, you can think of these "values" as just being write requests. These values can be more general, though, and we'll see some examples of that later on in this article. even in the face of network partitions.

In Paxos, our processes are split into two main roles:

• Proposers which propose values
• Acceptors which accept values

Firstly, in the absolute simplest scenario where we have only one acceptor and no failure or message loss of any kind, we want to be able to guarantee that a value is decided upon. To facilitate this, we specify that an acceptor must accept the first proposal that it receives, no questions asked.

In the simple scenario described above, the sequence of actions is very simple:

• The proposer proposes some value $$V$$ to the acceptor
• The acceptor receives that proposal (due to there being no message failure of any kind)
• The acceptor accepts the value $$V$$
• We say that $$V$$ has been decided
This then leaves us having to figure out how to solve for all of the more difficult scenarios. For example, what should we do in the case where we have multiple acceptors? The fundamental nature of consensus is that we must decide on one and only one value. How then should we determine the condition for saying that a single value has been decided across all of the acceptors?

Assuming all acceptors can only accept one value, then it follows that if some set of acceptors $$S$$ which forms a majority of acceptors all accept some value $$V$$ then there can exist no other set of acceptors $$S'$$ which forms a majority of acceptors that accepts a value $$V' \neq V$$ (i.e. there can only ever be one possible winner of a strict majority vote). Therefore, we'll say that if a majority of acceptors accept some value $$V$$, then we can consider the value $$V$$ to be decided.

This poses a problem, though, in the case where there are multiple competing values. What are we to do in the circumstance where one-third of the acceptors accept value $$V_1$$, one-third of the acceptors accept value $$V_2$$ and one-third of the acceptors accept value $$V_3$$? We need to introduce some kind of mechanism that allows acceptors to accept more than one value. In order to facilitate this, we introduce the concept of assigning a proposal number to each proposal. You can think of proposal numbers as being equivalent to rounds of voting, where the rounds of voting that the acceptors go through are indexed by the proposal number.

Now what remains to be enforced is that if some value $$V$$ is decided by the acceptors at proposal number $$n$$, then the same value $$V$$ must be decided by the acceptors at all proposal numbers $$m > n$$. If this condition didn't hold, then it would violate the fundamental principle that we must decide on one and only one value.

In order to enforce this, we want to be able to say that if value $$V$$ is decided at proposal number $$n$$, then every proposal with proposal number $$m > n$$ accepted by any acceptor has the value $$V$$. However, we've already specified that our acceptors simply accept the first proposal that they see, so we can't enforce this constraint on the acceptor side of things. We need to enforce this constraint on the proposer side of things. That is, what we actually want to be able to say is that if value $$V$$ is decided at proposal number $$n$$, then every proposal with proposal number $$m > n$$ proposed by any proposer has the value $$V$$. Enforcing this constraint is where things get a bit tricky.

Let's say that value $$V$$ has been decided for proposal number $$n$$ and a proposer is looking to propose a value for proposal number $$n+1$$. From what we've said, the proposer must end up proposing the value $$V$$. How can we ensure that the proposer knows this? What we can do is have the proposer send out a request (called a prepare request) to the set of acceptors asking them to respond with the value and proposal number of the highest-numbered proposal (with proposal number less than $$n+1$$) they have accepted, if they have accepted any such proposal. If they have not accepted any such proposal, then send back a response acknowledging so. The proposer then waits for a majority of acceptor to respond before determining what value to propose. There are 2 possible scenarios:

1. The responding acceptors all have not accepted any proposal numbered less than $$n+1$$. Under this condition, the proposer is free to propose any value that it wants for proposal number $$n+1$$.
2. Some responding acceptor has accepted a proposal numbered less than $$n+1$$. Under this condition, the proposer must propose the value $$V$$ for proposal number $$n+1$$ where $$V$$ is the value of the highest numbered proposal reported to be accepted by the responding acceptors.

From what we've seen, this scheme seems promising. However, there is somewhat of a subtle issue with it.

Let's again say we have decided on value $$V$$ at proposal number $$n$$. From what we have implemented, it must be the case that if we later decide on a value for proposal number $$m > n$$, then the decided value must be $$V$$. However, we have not implemented any mechanism to prevent us from later deciding on value $$V' \neq V$$ at proposal number $$n-1$$, which would violate the principle that we must decide on only one value. In order to implement this, as part of an acceptor responding to a prepare request for some proposal at proposal number $$n$$, we make the acceptor promise not to accept any proposal with proposal number less than $$n$$.

We now have a complete description of phase 1 of our complete consensus algorithm. The proposer sends out prepare requests to the acceptors asking them to not accept any more proposals with a proposal number less than $$n$$. Upon receving a prepare request for proposal number $$n$$, an acceptor will not respond if it has already promised to not accept any proposals with number $$n$$. Otherwise, it will respond back to the proposer acknowledging that it has promised not to accept any proposals less than $$n$$. Furthermore, it will also send the proposer the value and the proposal number of the highest numbered proposal that it has accepted. The proposer waits for a majority of acceptors to respond so that it knows what value it is allowed to propose for proposal number $$n$$.

Phase 2 is relatively straightforward. The proposer sends out proposals to the acceptors and waits for the acceptors to send back an acknowledgement that they have accepted the proposal. Note that if an acceptor has promised to not accept a proposal of the given number in the meantime between phase 1 and phase 2, then it simply will not accept the proposal or respond in phase 2 (alternatively, it can inform the proposer of this promise; neither decision would violate correctness). If the proposer receives an acceptance acknowledgement from a majority of acceptors, then it sends out a message to all acceptors alerting them of the value that has been decided.

## Applications of Consensus

As it so happens, we can use consensus as the main tool to solve a number of interesting problems in distributed systems. Here we go through a brief overview of a few such problems.

### Mutual Exclusion

In the mutual exclusion problem, we want to make sure that only one process has access to some resource at a given time. Typically, the reason for this is to avoid race conditions where two different processes try to edit a single resource concurrently.

For example, let's say Dave has $1050 in his bank account and he tries to purchase two different items simultaneously: item $$A$$ that costs$1000 and item $$B$$ that costs $100. Then let's say the process for withdrawing from his account looks something like this: def withdraw(account, amount): initial_balance = account.balance() if amount > initial_balance: raise Exception("Insufficient funds") account.set_balance(initial_balance - amount) return account.balance() Without applying some synchronization mechanism to the withdrawal process, it's possible for the execution of the purchase of item $$A$$ and the purchase of item $$B$$ to be interleaved such that the purchases of both items succeed and Dave has either$50 or even \$950 in his bank account.

Instead, what we would want to do is grant each withdrawal process mutually exclusive access to the bank account object. If a withdrawal process is started while another withdrawal process has mutually exclusive access to the bank account object, then it waits until the other withdrawal process is done before starting. This would look something like this: def withdraw(account, amount): mutex.acquire() initial_balance = account.balance() if amount > initial_balance: raise Exception("Insufficient funds") account.set_balance(initial_balance - amount) new_balance = account.balance() mutex.release() return new_balance

With this synchronization mechanism in place, only one of the purchases of item $$A$$ or item $$B$$ would succeed (i.e. whichever one was processed first).

If we let the "values" that we're voting on in the consensus protocol be which process owns mutually exclusive access to some resource, then we can directly solve mutual exclusion via consensus.

In the leader election problem, we want to designate one process from a group of processes as the "leader" or the "coordinator" in some sense. An example of when we would want to do this is when implementing Paxos.

It wasn't mentioned in the description of the Paxos protocol because it doesn't affect the correctness of the protocol, but it's always a good idea to elect one proposer as the leader through which all proposals get routed. The reason for this is that if you allow multiple proposers to directly propose values to the acceptors, then it's possible for two proposers to deadlock one another by constantly sending prepare requests for higher and higher proposal numbers before the acceptors ever get the chance to even accept a value at all. Electing a single leader guarantees that progress can be made.

If we consider the role as leader to be a "resource", then we can immediately see that leader election is simply a special case of mutual exclusion. Therefore, leader election is also directly solvable via consensus.

In the atomic broadcast (or total order broadcast) problem, we want to ensure that all processes in our system process the exact same set of messages in the exact same order. We can again fit this directly into the framework of a consensus protocol by saying that the "value" that we're trying to agree on in our consensus protocol is a sequence of messages. Therefore, atomic broadcast is yet another problem that is directly solvable via consensus.

### Fast Generalized Paxos

See a toy implementation of the Fast Generalized Paxos algorithm as described in the recent paper A Generalised Solution to Distributed Consensus by Heidi Howard and Richard Mortier.