
Eventual Consistency is no silver bullet for the CAP Theorem
Eventual Consistency is often stated as the solution to the consistency/performance tradeoff in distributed systems (aka the CAP Theorem), but it's just one (weak) solution of many.
There is a fundamental tradeoff in distributed systems between consistency and performance, coarsely defined by the CAP Theorem. This tradeoff pops up whenever you have more than one copy of any given piece of data, which is frequent. I have seen a common narrative play out over my career when builders and stakeholders are confronted with this tradeoff. Inevitably, when data is distributed, there comes a point when lack of consistency or performance becomes noticeable enough to be classified as a defect. Someone points out that you can’t have one without losing the other. And then someone replies with the line I have heard several times - “we don’t need real-time consistency, Eventual Consistency is good enough” - mic drop.
The appealing idea here is that if you can accept a period of staleness between two copies of data, then you don’t have to suffer the performance cost of synchronizing the copies on write. You can queue updates and propagate them to other copies in the background. At a high level, it seems like an easy tradeoff. People are accustomed to a bit of staleness, and it’s a small cost to pay to not have to block write operations on updating all copies, which also means they fail if you can’t reach the other copies. Often it’s the failures in one service causing failures in another that brings this issue to a head.
The problem with this idea is that Eventual Consistency is not just a bit of staleness. Eventual Consistency is not the silver bullet solution to this problem (spoiler - there is no silver bullet). In fact, much of the literature on consistency in distributed systems excludes Eventual Consistency altogether because it is a very weak consistency model, and there are “stronger” models that perform just as well1.
The primary weakness of Eventual Consistency is that it provides absolutely no ordering guarantees, and therefore changes may be applied out of order, resulting in intermediate anomalous states. Values may appear in replicas that would have never been seen in the primary. Let’s say you have a distributed data object with two operations, increment (+) and decrement (-). There is a primary data store and a single read replica. The data value starts at 0, and a user issues several increment and decrement commands in fast succession. These are applied in the correct order to the primary, because the communication between the client and server is synchronous, let’s say +, -, +, -, +. The data value ends up as 1 in the primary. Concurrent to the operations mutating our data value, the primary service enqueues the increment and decrement events to propagate to the read replica.
Eventual Consistency offers no guarantees that these events are published, consumed, or made visible in order. The read replica could process them in the order -, -, +, +, +. Eventually, the value converges to 1, which is good. However, in the interim the value reads -1 and -2, states the primary never entered. There may be business invariants that specify that this value should always be positive. How does an application gracefully handle these anomalous states that violate business invariants? You can probably think of cases where these anomalies would be acceptable, and cases where they would not. That highlights the point of this article - Eventual Consistency is not the solution to the consistency/performance tradeoff of distributed systems.
There are, in fact, many other solutions that are more or less appropriate in different contexts where the tradeoffs have different dimensions and weights. Eventual Consistency is but one Consistency Model. Another consistency model I quite like because it performs just as well as Eventual Consistency but offers logical ordering guarantees is Causal Consistency. Causal Consistency is basically Eventual Consistency plus causal ordering, meaning that the protocol for replicating changes has to include some data structure for tracking the causality relation that orders events. Events that can have a causal effect on other events have to be applied (or “visible”) before their dependents. In the example above, the out-of-order decrement events just wouldn’t be visible until the increment events they depend on.
While Causal Consistency strikes a great balance of consistency and performance, the tracking of causal ordering and logic to make visible only those events whose dependencies are satisfied does add complexity. For that reason, I have rarely seen Causal Consistency “in the wild” - it’s hard to get everyone aligned on a complex solution, especially across team boundaries, which is typically where data gets replicated (see Conway’s Law - a single team is unlikely to distribute their data in a way that requires complex synchronization). I think the implementation of Causal Consistency is actually not excessively complex, but understanding why you would care about an abstract concept like “causality” requires understanding of a complex problem - which is why I’m trying to spread that understanding.
Another aspect to consider is that both Eventual and Causal Consistency are “single-object” Consistency Models, which means they are only relevant for modeling consistency of single operations on individual distributed data objects, and not distributed transactions that may span multiple operations on multiple distributed data objects. If you have a transactional operation that either requires multiple distributed objects to commit or rollback atomically or requires multiple operations on a distributed object to commit or rollback atomically, then you have to use a different strategy. There are transactional Consistency Models, for example Serializable Consistency which can be implemented with a 2- or 3-phase commit algorithm. You may be familiar with some of these in the form of transaction Isolation Levels in SQL DMBSs (e.g. Read Committed, Read Uncommitted).
Another nuance to consider when shifting to an eventually consistent model is that you cross the boundary from synchronous to asynchronous communication. This is a qualitatively different low-level detail that leaks all the way up the stack. If you were to, for example, use a CQRS style architecture where writes go to one store and reads happen in an eventually (or causally) consistent replica, then you must account for the latency between writes and reads in your product design. As a perhaps archaic seeming example in 2024, if you are issuing a write and then reading the result of that write in a full request/response cycle, like submitting a form and then rendering a view of the form as the response to the form submission, you may render an old view. While old school, request/response is pretty darn simple, and therein lies its value. CQRS and eventually consistent stores also have value, but it’s important to think about the door you walk through when you change from a synchronous to asynchronous communication protocol.
All this is to say that Eventual Consistency is not always the answer to consistency and performance woes. It certainly can be, but there are complex tradeoffs to consider. Too often, I think, when Eventual Consistency is offered as a solution, it is without full understanding of what it really means, which is not just “a little delay in consistency”. It’s more fair to characterize Causal Consistency that way. Eventual Consistency is relatively simple, but still more complex than fully synchronous consistency. It is a weak model that allows anomalous intermediate states that may violate business invariants. And it may not be appropriate for replicating transactional writes. Those downsides may very well be outweighed in some contexts, but it’s important to be aware of them and explicitly accept them, or choose another solution.